Problem Set 5
Due Friday Oct. 28, 10 am
Problems
- This problem asks you to use the
future
package to process some Wikipedia traffic data and makes use of tools discussed in Section on October 7. The files in/scratch/users/paciorek/wikistats/dated_2017_small/dated
(on the SCF) contain data on the number of visits to different Wikipedia pages on November 4, 2008 (which was the date of the US election in 2008 in which Barack Obama was elected). The columns are: date, time, language, webpage, number of hits, and page size. (Note that the Unit 7 Dask bag example and Question 2 below use a larger set of the same data.)
In an interactive session on the SCF Linux cluster (ideally on the
low
partition, but if necessary on thehigh
partition), started usingsrun
as discussed in Section:- Copy the data files to a subdirectory of the
/tmp
directory of the machine your interactive session is running on. (Keep your code files in your home directory.) Putting the files on the local hard drive of the machine you are computing on reduces the amount of copying data across the network (in the situation where you read the data into your program multiple times) and should speed things up in step ii. - Write efficient R code to do the following: Using the
future
package, with eitherfuture_lapply
orforeach
with thedoFuture
backend, write code that, in parallel, reads in the space-delimited files and filters to only the rows that refer to pages where “Barack_Obama” appears in the page title (column 4). You can use the code from Unit 6 as a template. Collect all the results into a single data frame. In yoursrun
invocation and in your code, please use 4 cores in your parallelization so that other cores are saved for use by other users/students. IMPORTANT: before running the code on the full set of data, please test your code on a small subset first (and test your function on a single input file serially). - Tabulate the number of hits for each hour of the day. (I don’t care how you do this - you could use dplyr or base R functions or something else.) Make a (time-series) plot showing how the number of visits varied over the day. Note that the time zone is UTC/GMT, so you won’t actually see the evening times when Obama’s victory was announced - we’ll see that in Question 2.
- Remove the files from
/tmp
.
Hints: (a)
readr::read_delim()
should be quite fast if you give it information about the structure of the files, (b) there are lines with fewer than 6 fields, butread_delim()
should still work and simply issue a warning, and (c) there are lines that have quotes that should be treated as part of the text of the fields and not as separators.- Copy the data files to a subdirectory of the
Now replicate steps i and ii but using
sbatch
to submit your job as a batch job to the SCF Linux cluster, where step ii involves running R from the command line usingR CMD BATCH
. You don’t need to make the plot again. Note that you need to copy the files to/tmp
in your submission script, so that the files are copied to/tmp
on whichever node of the SCF cluster your job gets run on. Make sure that as part of yoursbatch
script you remove the files in/tmp
at the end of the script. (Why? In general/tmp
is cleaned out when a machine is rebooted, but this might take a while to happen and many of you will be copying files to the same hard drive.)
Consider the full Wikipedia traffic data for October-December 2008 (already available in
/var/local/s243/wikistats/dated_2017
on any of the SCF cluster nodes in the low or high partitions).- Explore the variation over time in the number of visits to Barack Obama-related Wikipedia sites, based on searching for “Barack_Obama” on English language Wikipedia pages. You should use Dask to do the reading and filtering. Then group by day-hour (it’s fine to do the grouping/counting in Python in a way that doesn’t use Dask data structures). You can do this either in an interactive session using
srun
or a batch job usingsbatch
. And if you usesrun
, you can run Python itself either interactively or as a background job. Time how long it takes to read the data and do the filtering to get a sense for how much time is involved working with this much data. Once you have done the filtering and gotten the counts for each day-hour, you can simply use standard R or Python code on your laptop to do some plotting to show how the traffic varied over the days of the full October-December time period and particularly over the hours of November 3-5, 2008 (election day was November 4 and Obama’s victory was declared at 11 pm Eastern time on November 4). - Extra credit: Carry out some other in-depth analysis of the Wikipedia data (it doesn’t have to involve Barack Obama), addressing a question of interest to you.
Notes:
- I’m not expecting you to know any more Python than we covered in the Unit 6/7 material on Dask and in Section, so feel free to ask for help (and for those of you who know Python to help out) on Python syntax on the discussion forum or in office hours.
- There are various ways to do this using Dask bags or Dask data frames, but I think the easiest in terms of using code that you’ve seen in Unit 7 is to read the data in and do the filtering using a Dask bag and then convert the Dask bag to a Dask dataframe to do the grouping and summarization. Alternatively you should be able to use
foldby()
fromdask.bag
, but figuring out what arguments to pass tofoldby()
is a bit involved. - Make sure to test your code on a portion of the data before doing computation on the full dataset. Reading and filtering the whole dataset will take something like 60 minutes with 16 cores. You MUST test on a small number of files on your laptop or on one of the stand-alone SCF machines (e.g., radagast, gandalf, arwen) before trying to run the code on the full 120 GB (zipped) of data. For testing, the files are also available in
/scratch/users/paciorek/wikistats/dated_2017
. - When doing the full computation via your Slurm job submission:
- Don’t copy the data (unlike in Question 1) to avoid overloading our disks with each student having their own copy.
- Please do not use more than 16 cores in your Slurm job submissions so that cores are available for your classmates. If your job is stuck in the queue you may want to run it with 8 rather than 16 cores.
- As discussed in Section, when you use sbatch to submit a job to the SCF cluster or srun to run interactively, you should be using the
--cpus-per-task
flag to specify the number of cores that your computation will use. In your Python code, you can then either hard-code that same number of cores as the number of workers or (better) you can use theSLURM_CPUS_PER_TASK
shell environment variable to tell Dask how many workers to start.
- Explore the variation over time in the number of visits to Barack Obama-related Wikipedia sites, based on searching for “Barack_Obama” on English language Wikipedia pages. You should use Dask to do the reading and filtering. Then group by day-hour (it’s fine to do the grouping/counting in Python in a way that doesn’t use Dask data structures). You can do this either in an interactive session using
Using the Stack Overflow database, write SQL code that will determine which users have asked html-related questions but not css-related questions. Those of you with more experience with SQL might do this in a single query, but it’s perfectly fine to create one or more views and then use those views to get the result as a subsequent query. Report how many unique such users there are. There are various ways to do this, of which I only covered some approaches in Unit 7 and in the videos. You can run your query via either R or Python.
This question prepares for the discussion of a simulation study in section on Friday October 28. The goal of the problem is to think carefully about the design and interpretation of simulation studies, which we’ll talk about in Unit 9, in particular in Section on Friday October 28. In particular, we’ll work with Cao et al. (2015), an article in the Journal of the Royal Statistical Society, Series B, which is a leading statistics journal. The article is available as
cao_etal_2015.pdf
under theps
directory on GitHub. Read Sections 1, 2.1, and 4 of the article. Also read Sections 2 of Unit 9. Briefly (a few sentences for each of the three questions below) answer the following questions.
You don’t need to understand their method for fitting the regression [i.e., you can treat it as some black box algorithm] or the theoretical development. In particular, you don’t need to know what an estimating equation is - you can think of it as an alternative to maximum likelihood or to least squares for estimating the parameters of the statistical model. Equation 3 on page 759 is analogous to taking the sum of squares for a regression model, differentiating with respect to
- What are the goals of their simulation study, and what are the metrics that they consider in assessing their method?
- What choices did the authors have to make in designing their simulation study? What are the key aspects of the data generating mechanism that might affect their assessment of their method?
- Consider their Tables reporting the simulation results. For a method to be a good method, what would one want to see numerically in these columns?
Comments
eval=FALSE
. You can paste in any output you need to demonstrate your work. Remember that use can use “``" to delineate blocks of text you want printed verbatim. You may also want to put your answer to the SQL question in an unevaluated chunk (or possibly use
cache=TRUE` as a chunk option), depending on how long the query takes.