Scrape Columbia CS Courses and Evaluation

n0lean Aug 18, 2018

Tools

Scrapy: for web scrape
ghostscript: for pdf text extraction

Get all courses information

Columbia uses a website called Vergil to help student plan their courses. So all courses information could be found on this website. First, check the structure of the web page.

When search COMS (CS department courses), the url is https://vergil.registrar.columbia.edu/#/courses/COMS .

In Chrome Developer tool Network panel, there is a json file which contains all results information of the query. Let’s check the url of this file. It’s from https://vergil.registrar.columbia.edu/doc-adv-queries.php?key=COMS&moreresults=2 . It seems key and moreresult are two options in this query and this website will return a json file containing all we need. So we could write a spider to do the rest.

For convenient, I only get two categories EECS and COMS. The json file has class info as well as the instructors. As I want to get the professors evaluation results from the university evaluation system, I extract the professor uni id (a unique identification code in Columbia).

The evaluation page looks like this:

Each link contains a pdf file of evaluation results. It would easier for us if there is a json file containing all links. Let’s have a check.

Great, there actually is one, and the name is the professor’s uni id. Link is https://vergil.registrar.columbia.edu/cms/cw/eval/eg2173 . The path key in json is what we need, the link to the pdf evaluation result. We could use a new scrawler to do all the jobs!

In order to see the evaluation results, we need to pass the CAS (Central Authentication Service) authentication. It is pretty easy. The login url is https://cas.columbia.edu/cas/login?service=ias-qmss&destination=https%3A%2F%2Fvergil.registrar.columbia.edu%2Fcms%2Fuser%2Fwind . Login first, scrape then!

Lastly, we have all links to evaluation result of all professors and all courses. What left is to download them. We also need to get through CAS first. A new scrawler could do it.

The pdf file looks like:

This is super unfriendly if we want to analysis number inside. Luckily we have some tools here. Ghostscript is one of the good choices.

1	gs -sDEVICE=txtwrite -o output.txt input.pd

This command could transform pdf files into txt. The result looks like:

                                                     Program Evaluation System
Title: Fall 2014 COMS Final Evaluation                       Dates: 12/01/2014 - 12/08/2014
Course: COMSE6998_009_2014_3 / HIGH-DIMENSIONAL                      Responses: 7/25 -  28.00%
DATA ANALYSI
Taught by: Aleksandr Y. Aravkin                                        Number of Participants: 25
                                                                    Enrollment of All Students: 25 
course
                     1  Course: Amount Learned                N=7
                        1 Poor            0 (0%)
                        2 Fair            0 (0%)
                        3 Good                        2 (29%)
                        4 Very Good                   2 (29%)
                        5 Excellent                         3 (43%)
                     Median 4 Interpolated Median 4.25 Mean 4.14 Std Dev 0.90
                     2  Course: Appropriateness of Workload   N=7
                        1 Poor            0 (0%)
                        2 Fair            0 (0%)
                        3 Good            0 (0%)
                        4 Very Good              2 (29%)
                        5 Excellent                         5 (71%)
                     Median 5 Interpolated Median 4.80 Mean 4.71 Std Dev 0.49
                     3  Course: Fairness of Grading Process   N=7
                        1 Poor            0 (0%)
                        2 Fair            0 (0%)
                        3 Good            0 (0%)
                        4 Very Good              2 (29%)
                        5 Excellent                         5 (71%)
                     Median 5 Interpolated Median 4.80 Mean 4.71 Std Dev 0.49
                     4  Course: Overall Quality               N=7
                        1 Poor            0 (0%)
                        2 Fair            0 (0%)
                        3 Good            0 (0%)
                        4 Very Good                    3 (43%)
                        5 Excellent                         4 (57%)
                     Median 5 Interpolated Median 4.63 Mean 4.57 Std Dev 0.53
                                           Page 1 of 4

There is still many work to do before we could truly use these data. But I’d like to leave this work in the future.