We’ll experiment with webscraping and manipulating HTML by getting publication information off of Google Scholar. Note that Google Scholar does not have an API, so we are forced to deconstruct the queries that are produced when we point and click on the website.
Your functions may be short enough that it’s ok to put a function directly within a code chunk in your qmd file. Or you might choose to put one or more functions into a .py file and use inspect.getsource()
to show us the code. For this problem in which the focus of the work is the function and how it works, it will generally be best to show the function code as part of the problem solution rather than in an appendix.
Go to Google Scholar and enter the name (including first name to help with disambiguation) for a researcher whose work interests you. (If you want to do the one that will match the problem set solutions, you can use “Michael Jordan”, who is a well-known statistician and machine learning researcher here at Berkeley.) If you’ve entered the name of a researcher that Google Scholar recognizes as having a Google Scholar profile, you should see that the first item shown in the results page is a “User profile”. Now, based on the information returned, show the HTML element containing the Google Scholar ID and determine the Google Scholar ID for the researcher. Ideally (see the extra credit part (e)) we would automate that process and write a Python function that returns the ID, but see Question 5 for why that seemingly would violate Google Scholar’s terms of use.
Create a function that constructs the http GET request (and submits that request) to get the citations for the scholar, taking the ID as the input argument and returning the HTML as a Python object.
IMPORTANT: While running the query in an automated fashion is seemingly allowed (see Problem 5), Google may return “429” errors because it detects automated usage. Here are some things to do in that case:
- You can try to download the HTML file via the UNIX
curl
command, which you can run within Python as subprocess.run(["curl", "-L", request_string], capture_output=True)
. If necessary for part (c), you can use curl
or wget
from the command line to separately download the file and then read that file into Python.
- When developing your code, once you have the code in this part of the problem working to download the HTML, use the downloaded HTML to develop the remainder of your code for part (c) and don’t keep re-downloading the HTML as you work on the remainder of the code.
For now, you can assume the user will provide a valid ID and that Google Scholar returns a result for the specified person. We’ll deal with making the code more robust in PS2.
Now write a function that processes the HTML to return a Pandas (or Polars) data frame with the citation information (article title, authors, journal information, year of publication, and number of citations as five columns of information) for the researcher. Try your function on a second researcher to provide more confidence that your function is working properly. (We’ll add unit tests in PS2). Given the comments in (b), ideally your function will work either if given (i) the name of a file that you’ve already downloaded or (ii) the HTML content produced by your function in (b).
Hint: a possibly useful argument for find_all
is to request element(s) with certain attributes, e.g., html.find("p", attrs = {'class': 'songtext'})
for finding a p
element whose class is songtext
.
Create a requirements file (based on either pip or Conda) that has the necessary information (in particular Python package versions) to reproduce the environment in which you ran your code. Include this file in your GitHub repository directory for this problem set.
(Extra credit) If you’d like extra practice, write a Python function that will return the Google Scholar ID when the function is provided an html file as its argument. The file would be the file that is returned by searching for a researcher name as the input at scholar.google.com. As discussed in (a), your function should not query Google Scholar using the requests
package, but rather should manipulate an HTML file that you download after manually querying Google Scholar yourself.
(Extra credit) If you’d like extra practice, fix your function so that you get all of the results for a researcher and not just the first 20. E.g., for Michael Jordan there are several hundred.
Comments