Syllabus

Course description

Statistics 243 is an introduction to statistical computing, taught using Python, with ‘statistical’ defined broadly to include data science and machine learning. The course will cover both programming concepts and statistical computing concepts. Programming concepts will include data and text manipulation, regular expressions, data structures, functions and variable scope, memory use, efficiency, debugging, testing, and parallel processing. Statistical computing topics will include working with large datasets, numerical linear algebra, computer arithmetic/precision, simulation studies and Monte Carlo methods, and numerical optimization. A goal is that coverage of these topics complement the models/methods discussed in the rest of the statistics/biostatistics graduate curriculum. We will also cover the basics of UNIX/Linux, in particular shell scripting and operating on remote servers, as well as a bit of R.

What the course is not

While the course is taught using Python and you will learn a lot about using Python at an advanced level, this is not a course about learning Python. Rather the focus of the course is computing for statistics and data science more generally, using Python to illustrate the concepts.
This is not a course that will cover specific statistical/machine learning/data analysis methods.

Prerequisites

Informal prerequisites: If you are not a statistics or biostatistics graduate student, please chat with me if you’re not sure if this course makes sense for you. A background in calculus, linear algebra, probability and statistics is expected, as well as a basic ability to operate on a computer (but I do not assume familiarity with the UNIX-style command line/terminal/shell). Furthermore, I’m expecting you will know the basics of Python, at the level of the Python material in our computing skills workshop offered Aug. 18-19, 2025. If you don’t have that background you’ll need to spend time in the initial couple weeks getting up to speed. The workshop materials are a good resource.

In general, the material in the workshop is a good reference for the course. While many of you will have attended, it’s not required to take the course. But if you didn’t attend, it’s worthwhile to look through the workshop materials in general and for additional information on various topics we’ll see in Units 1, 3, 4, and 5.

Objectives of the course

The goals of the course are that, by the end of the course, students be able to:

operate effectively in a UNIX environment and on remote servers and compute clusters;
have a solid understanding of general programming concepts and principles, and be able to program effectively (including having an advanced knowledge of Python functionality);
be familiar with concepts and tools for reproducible research and good scientific computing practices; and
understand in depth and be able to make use of principles of numerical precision, numerical linear algebra, optimization, and simulation for statistics- and data science-related analyses and research.

Topics (in order with rough timing)

The ‘days’ here are (roughly) class sessions, as general guidance.

Introduction to UNIX, operating on a compute server (1 day)
Data formats, data access, webscraping, data structures (2 days)
Debugging, good programming practices, reproducible research (1 day)
The bash shell and shell scripting, version control (3 days)
Programming concepts and advanced Python programming: text processing and regular expressions, object-oriented programming, functions and variable scope, memory use, efficient programming (9 days)
Parallel processing (2 days)
Working with databases, hashing, and big data (3 days)
Computer arithmetic/representation of numbers on a computer (3 days)
Simulation studies and Monte Carlo (2 days)
Numerical linear algebra (5 days)
Optimization (5 days)
Graphics (1 day)

Personnel

Instructor:
- Chris Paciorek (paciorek@stat.berkeley.edu)
GSI
- João Vitor Romano (jv.romano@berkeley.edu)
Office hours can be found here.

Course websites: GitHub, Ed Discussion, Gradescope, and bCourses

Key websites for the course are:

This course website, which is hosted on GitHub pages, and the GitHub repository containing the source materials: https://github.com/berkeley-stat243/fall-2025
SCF tutorials for additional content: https://statistics.berkeley.edu/computing/training/tutorials
Ed Discussion site for discussions/Q&A: https://edstem.org/us/courses/81308/discussion
bCourses site for course capture recordings (see Media Gallery) and possibly some other materials: https://bcourses.berkeley.edu/courses/1546970
Gradescope for assignments (also linked from bCourses): https://www.gradescope.com/courses/1076535

All course materials will be posted here on the website (and on GitHub) except for video content, which will be in bCourses.

Course discussion

We will use the course Ed Discussion site for communication (announcements, questions, and discussion). You should ask questions about class material and problem sets through the site. Please use this site for your questions so that either João or I can respond and so that everyone can benefit from the discussion. I strongly encourage you to respond to or comment on each other’s questions as well (this will help your class participation grade), although of course you should not provide a solution to a problem set problem. If you have a specific administrative question you need to direct just to me, it’s fine to email me directly or post privately on the Discussion site. But if you simply want to privately ask a question about content, then just come to an office hour or see me after class or João during/after section.

Ed Discussion settings

You are responsible for keeping track of all course announcements and discussions, which we’ll do on the Discussion forum.

I suggest you to modify your settings on Ed Discussion so you are informed by email of postings.

If you’re enrolled in the class you should be a member of the Ed Discussion group and be able to access it. If you’re auditing or not yet enrolled and would like access, make sure to fill out the course survey and I will add you.

Course material

Primary materials: Course notes on this course webpage/GitHub and SCF tutorials.
Back-up textbooks (generally available via UC Library via links below):
- For Python, bash/shell, Git and computing/software skills: Damien Irving, Kate Hertweck, Luke Johnston, Joel Ostblom, Charlotte Wickham, and Greg Wilson. Research Software Engineering with Python
- For bash: Newham, Cameron and Rosenblatt, Bill. Learning the bash Shell
- For Quarto: The Quarto reference guide
- For statistical computing topics:
  - Gentle, James. Computational Statistics
  - Gentle, James. Matrix Algebra or Numerical Linear Algebra with Applications in Statistics
- Other resources with more detail on particular aspects of statistical computing concepts:
  - Lange, Kenneth; Numerical Analysis for Statisticians, 2nd ed. First edition available through UC library
  - Monahan, John; Numerical Methods of Statistics

Section

The GSI will lead a two-hour discussion section each week (there are two sections). These will cover various topics as well as provide time for the problem set quizzes (more details on those below). In some weeks (particularly those without quizzes, these may only last for about one hour of actual content, but the second hour may be used as an office hour with the GSI or for troubleshooting software during the early weeks.

The discussion sections will vary in format and topic, but material will include demonstrations on various topics (version control, debugging, testing, etc.), group work on these topics, discussion of relevant papers, and discussion of problem set solutions.

Attend the section you are assigned to

The first section (12-2 pm) generally has more demand, so to avoid having too many people in the room, you should go to your assigned section unless you talk to me first (e.g., if you have a one-time conflict.

Computing Resources

Most work for the course can be done on your laptop. Later in the course we’ll also use the Statistics Department Linux cluster. You can also use the SCF JupyterHub or the campus DataHub to access a bash shell or run an IPython notebook.

DataHub limitations

The campus DataHub is limited in terms of number of CPU cores and memory so won’t be suitable for more computationally-intensive work later in the semester, including problem sets and labs. But you’re welcome to use it otherwise.

The software needed for the course is as follows:

Access to the UNIX command line (bash shell)
Git
Python (the Miniforge installation of Conda is recommended but by no means required)
Quarto
VS Code – this is not required, but we’ll be using it to a greater or lesser extent (still to-be-determined)

See the “how tos” in the left sidebar for tips on software installation and access to a UNIX shell, which you’ll need to be able to do by the second week of class.

Class time

My goal is to have classes be an interactive environment. This is both more interesting for all of us and more effective in learning the material. I encourage you to ask questions and will pose questions to the class to think about, respond to via online polling or Google forms, and discuss. To increase time for discussion and assimilation of the material in class, before some classes I may ask that you read material or work through tutorials in advance of class. Occasionally, I will ask you to submit answers to questions in advance of class as well.

Phone/Laptop use in class

Please do not use phones during class (except if using to respond to surveys/in-class questions) and limit laptop use to the material being covered.

Student backgrounds with computing will vary. For those of you with limited background on a topic, I encourage you to ask questions during class so I know what you find confusing. For those of you with extensive background on a topic (there will invariably be some topics where one of you will know more about it than I do or have more real-world experience), I encourage you to pitch in with your perspective. In general, there are many ways to do things on a computer, particularly in a UNIX environment and in Python, so it will help everyone (including me) if we hear multiple perspectives/ideas.

Finally, class recordings for review or to make up for absence will be available through the bCourses Media Gallery, available on the Media Gallery tab on the bCourses page for the class.

Course requirements and grading

Scheduling Conflicts

Campus asks that I include this information about conflicts: Please notify me in writing by the second week of the term about any known or potential extracurricular conflicts (such as religious observances, graduate or medical school interviews, or team activities). I will try my best to help you with making accommodations, but I cannot promise them in all cases. In the event there is no mutually-workable solution, you may be dropped from the class.

The main conflict that would be a problem would be the two mini-exams, whose dates I will determine in late August / early September.

Mini-exams and problem set quizzes are in-person. There is no remote option, and the only make-up accommodations I will make are for illness or serious personal issues. Do not schedule any travel that may conflict with a mini-exam.

Course grades

The grade for this course is primarily based on assignments due every 1-2 weeks, in-section quizzes on the assignments, two mini-exams (likely in early-mid October and mid-late November), and a final group project. I may also provide extra credit questions on some problem sets. There is no final exam.

45% of the grade is based on the problem sets, in these two components:
- 15% problem set completion
- 30% in-section problem-set quizzes
30% on the mini-exams,
15% on the project, and
10% on your class participation:
- your responses to the in-class (and occasionally in-advance-of-class) Google forms questions,
- completion of occasional non-problem set assignments (including assignments in lab section), and
- substantive contribution to discussions on Ed (i.e., responding to your classmates’ questions and asking thoughtful questions about course material).

Grades will generally be As and Bs. An A involves doing all the work, getting full credit on most of the problem sets and quizzes, doing well on the mini-exams, and doing a thorough job on the final project.

Problem sets

Potential changes to problem set handling

The approach to problem sets given below is the first time I am doing this, in reaction to the impacts of AI. I may make modifications as the course proceeds. If so, I will be careful to let you know.

AI and problem sets

We are of course in the midst of huge changes from the rapid advances in AI, in particular generative AI, with large language model-enabled tools such as ChatBots and AI-assisted coding. The following approach to problem sets is intended to recognize that these tools are widely-used and can be very useful (both for day-to-day work and for learning), while still trying to make sure that we understand core computing and programming concepts and that we can build upon, critique, and debug what AI agents produce. A key challenge for you (and for me as an instructor) is thinking about how to use AI to develop the skills and expertise you will need without fooling yourself into thinking you understand more than you do. We are all still trying to figure this out!

I am not going to police your use of AI. You are welcome to use ChatBots and AI-assisted coding tools. You are also welcome to collaborate with others in the class. However, if you simply turn in answers that are mainly copies of what an AI tool (or your classmate) told you, you are probably not learning the material, and this is likely to be reflected in your performance on the problem set quizzes (and on the mini-exams, albeit less directly).

Think of the problem sets in combination with the quizzes (see below) as a train-test split. You have freedom in working with the training set, but some strategies could result in poor performance on the test set. In data analysis, this involves overfitting. In our class context, this involves relying too heavily on AI or your classmates and not doing the work to really understand the material.

How much to make use of AI is for you to consider carefully. My current suggestion (which is also how I suggest working with classmates) is that you initially try seriously to figure out a given problem on your own (without substantial input from AI or others). By substantial, I mean getting input on an entire solution or blocks of a solution. If you want to ask an AI for input on small portions of a problem (such as code syntax) that you then check and understand the result of, that seems to me to be a good use of the tools. After that, if you’re stuck or want to explore alternative approaches or check what you’ve done, then using AI more fully, consulting with your fellow students, and talking with the GSI and with me are all recommended. I.e., these tools help us brainstorm and do our work more efficiently; they don’t replace us thinking (at least not yet…). You should be sure you understand (and could explain to someone else) in detail what the code does and how/why it works. On that note, the problem set quizzes are intended to assess that.

Problem set quizzes

For each problem set, there will be a short quiz on paper in section very soon after the problem set is due (possibly the same day). The quiz will check your understanding of the material and will involve questions directly related to a subset of the problems. If you understood the material covered in the problems, you should do well on the quizzes without “studying” for the quiz.

At the moment I anticipate that the quizzes will be “closed book” – you won’t be able to refer to any materials during the quizzes. However, I may make some modifications to this as we go along. (This is the first time I am trying the approach of problem set quizzes.)

The quizzes will be intended to be fairly short and without time pressure. That said, for practical reasons, the time limit will be determined by the length of section, but my hope is that that is long enough that it is equivalent to not having a time limit.

Quiz retakes

If you do poorly on a quiz and would like to retake it, there will be an oral retake procedure. This will involve talking with me (probably during office hours) shortly after the quiz is graded. I will then have a conversation with you about the quiz questions (or similar questions) in order to assess your understanding.

If you need to miss a quiz because of illness, you can make use of the retake procedure as well.

Submitting assignments

We will be less willing to help you if you come to our office hours or post a question online at the last minute. Working with computers can be unpredictable, so give yourself plenty of time for the assignments.

In the first section (September 5), we’ll discuss how to submit your problem sets both on Gradescope and via your class GitHub repository, located at https://github.berkeley.edu/<your_calnet_username>.

There are several rules for submitting your assignments.

You should prepare your assignments using Quarto.

Quarto Markdown is an extension to the Markdown markup language that allows one to embed Python and R code within an HTML document. Please see the SCF dynamics document tutorial; there will be additional information in the first section and on the first problem set.
Problem set submission consists of both of the following:
1. A PDF submitted electronically through Gradescope, by the start of class (10 am) on the due date, and
2. An electronic copy of the PDF, code files, and Quarto document pushed to your class GitHub repository, following the instructions to be provided by the GSI.
On-time submission will be determined based on the time stamp of when the PDF is submitted to Gradescope.
Answers should consist of textual response or mathematical expressions as appropriate, with key chunks of code embedded within the document. Function definitions would generally be placed in a separate .py code file. The function definitions and any extensive additional code should be provided as an appendix. Before diving into the code for a problem, you should say what the goal of the code is and your strategy for solving the problem. Raw code without explanation is not an appropriate solution. Please see our qualitative grading rubric for guidance. In general the rubric is meant to reinforce good coding practices and high-quality scientific communication.
Any mathematical derivations may be done by hand and scanned with your phone if you prefer that to writing up LaTeX equations (but I recommend you become familiar with LaTeX equations if you aren’t already). You might also consider trying typst (which I have been meaning to do myself).
You must provide attribution for ideas obtained elsewhere, including other students, ChatBots, and AI coding assistants. If you got a specific idea for how to do part of a problem from AI or a fellow student, you should note that in your solution in the appropriate place (for specific syntax ideas, note this in a code comment), just as you would cite a book or URL (but you don’t need a formal citation). The reason for this is two-fold: first to reinforce standard scientific citation practices and second so that you are clear with yourself how much and in what way you are relying on AI (or other people).

Problem set grading

The grading scheme for problem sets is as follows. Each problem set will receive a numeric score for each of (1) presentation and explanation of results, (2) technical accuracy of code or mathematical derivation, and (3) code quality (style, structure, reproducibility, and creativity). For each of these three components, the possible scores are:

0 = no credit,
1 = partial credit
2 = full credit

Again, the qualitative grading rubric provides guidance on what we want to see for full credit.

For components #1 and #3, some of you will probably get a score of 1 for some problem sets as you develop good presentation and coding practices (hopefully only for the initial problem sets).

This grading scheme is intended to essentially be a complete-incomplete scheme, but with the ability for us to give you feedback on your presentation of results and code. We will not in general be commenting on the technical quality of your responses - for that please compare your work to the solutions I will distribute. Technical understanding (and your ability to explain that clearly) will be assessed via the quizzes.

For late problem sets, I will take off some points, increasingly so the later it is, up until the late deadline listed in Gradescope, after which it won’t be accepted or graded.

The grading scheme for the quizzes will be point-based.

Final project

The final project will be a joint coding project in groups of 3-4. I’ll assign an overall task, and you’ll be responsible for dividing up the work, coding, debugging, testing, and documentation. You’ll need to use the Git version control system for working in your group.

Feedback

I welcome comments and suggestions and concerns. Particularly good suggestions will count towards your class participation grade.

As we navigate the impact of AI together, I particularly welcome your thoughts on how to incorporate the AI tools in my presentations, problem sets, and labs and in your classwork.

Accomodations for Students with Disabilities

Please see me as soon as possible if you need particular accommodations, and we will work out the necessary arrangements.

Campus Honor Code

The following is the Campus Honor Code. With regard to AI, collaboration and independence, please see my comments regarding problem sets above – Chris.

The student community at UC Berkeley has adopted the following Honor Code: “As a member of the UC Berkeley community, I act with honesty, integrity, and respect for others.” The hope and expectation is that you will adhere to this code.

Collaboration and Independence: Reviewing lecture and reading materials and studying for exams can be enjoyable and enriching things to do with fellow students. This is recommended. However, unless otherwise instructed, homework assignments are to be completed independently and materials submitted as homework should be the result of one’s own independent work.

Cheating: A good lifetime strategy is always to act in such a way that no one would ever imagine that you would even consider cheating. Anyone caught cheating on a quiz or exam in this course will receive a failing grade in the course and will also be reported to the University Center for Student Conduct. In order to guarantee that you are not suspected of cheating, please keep your eyes on your own materials and do not converse with others during the quizzes and exams.

Plagiarism: To copy text or ideas from another source without appropriate reference is plagiarism and will result in a failing grade for your assignment and usually further disciplinary action. For additional information on plagiarism and how to avoid it, see, for example: http://gsi.berkeley.edu/teachingguide/misconduct/prevent-plag.html

Academic Integrity and Ethics: Cheating on exams and plagiarism are two common examples of dishonest, unethical behavior. Honesty and integrity are of great importance in all facets of life. They help to build a sense of self-confidence, and are key to building trust within relationships, whether personal or professional. There is no tolerance for dishonesty in the academic world, for it undermines what we are dedicated to doing – furthering knowledge for the benefit of humanity.

Your experience as a student at UC Berkeley is hopefully fueled by passion for learning and replete with fulfilling activities. And we also appreciate that being a student may be stressful. There may be times when there is temptation to engage in some kind of cheating in order to improve a grade or otherwise advance your career. This could be as blatant as having someone else sit for you in an exam, or submitting a written assignment that has been copied from another source. And it could be as subtle as glancing at a fellow student’s exam when you are unsure of an answer to a question and are looking for some confirmation. One might do any of these things and potentially not get caught. However, if you cheat, no matter how much you may have learned in this class, you have failed to learn perhaps the most important lesson of all.