Wednesday, 6 December 2017

Reflection on teaching a 400-level course on Big Data for statisticians.

This was the first course I have taught that I would consider a 'main track' course, in that the students were to learn more about what they were already competent in at the start. Most of the courses I have previously taught were 'service courses', in that they were designed and delivered by the Statistics Department in service to other departments that wanted their own students to have a stronger quantitative and experimental background (e.g. Stat 201, 203, 302, and 305). The exception, Stat 342, was designed for statistics majors, but is built as an introduction to SAS programming. Since most other courses in the program are taught using R or Python, teaching SAS feels like teaching a service course as well, in that I am teaching something away from the students' main competency, and enrollment is mainly driven by requirement rather than interest.

In my usual courses, I am frequently grilled from anxious students about what exactly is going to be on exams. Frequent complaints I receive in student responses are about how I spend too much time on 'for interest' things are not explicitly being tested on the exams. I've also found that I needed to adhere to rigid structure in course planning and in grading policy. Moving an assignment's due date, teaching something out of the stated syllabus order, changing the scope or schedule of a midterm, or even dropping an assignment and moving the grade weight have all caused a cascade of problems in previous classes.

Stat 440, Learning from Big Data, was a major shift.

I don't know which I prefer, and I don't know which is easier in the long term, but it is absolutely a different skill set. The bulk of the effort changed from managing people to managing content. I did not struggle to keep the classroom full, but I did struggle to meaningfully fill the classroom's time. I had planned to cover the principles of modern methods (cross validation, model selection, LASSO, dimension reduction, regression trees, neural nets), some data cleaning (missing data, imputation, image processing), some technology (SQL, parallelization, Hadoop), some text analysis (regular expressions, edit distance, XML processing), but I still had a couple of weeks that I had to fill at the end because of the lightning speed that I was able to burn through these topics without protest.


In 'big data', the limits of my own knowledge became a major factor. Most of what I covered in class wouldn't have been considered undergrad material ten years ago when I was a senior (imputation, LASSO, neural nets); some of it didn't exist (Hadoop). There are plenty of textbook and online resources for learning regression or ANOVA, but the information for many of the topics of this course were cobbled together from blog posts, technical reports, and research papers. A lot of resources were either extremely narrow in scope or vague to the point of uselessness. I needed materials that were high-level enough for someone not already a specialist to understand, and technical enough that someone well-versed in data science would get something of value from it, and I didn't find enough.

The flip side of this was that motivation was easy. Two of the three case studies assigned had an active competition component to them. The first such study was a US-based based challenge to use police data from three US cities. In this one, presentation was a major basis that the police departments would judge the results. As such, I had requests for help with plotting and geographic methods that were completely new to me. A similar thing happened with the 'iceberg' case study, based on this Kaggle competition. I taught the basics of neural nets, and at least three groups asked me about generalizations and modifications to neural nets that I didn't know about. (The other case study was a 'warm-up' in which I adapted material from a case study competition held by the Statistical Society of Canada. The students were not in active competition). At least 20% of the class has more statistical talent than I do.

In order to adapt to this challenge and advantage, I changed about mid-semester from my usual delivery method of a PDF slideshow to one of commentary while running through computer code. This worked well for material that would be directly useful for the three case study projects, such as all the image processing work I showed for the case study on determining the difference between icebergs and ships. It wasn't as good for material that would be studied for the final exam. I went through some sample programs on web scraping, and the feedback wasn't as positive for that, and the answers I got on the final exam for the web scraping question were too specific to the examples I had given.

A side challenge was the ethical dilemma of limiting my advice to students looking to improve their projects. I had to avoid using insights that other students had shared with me because of the competitive nature of the class. Normally if someone had difficulty with a homework problem, I could use their learning and share it with others, but this time, that wasn't automatically the case.

There was also a substantial size difference, I had 20-30 students, which is by far the smallest class I've ever lectured to. Previously, Stat 342 was my 'small' class, which had enrollment between 50-80, because it was compared to service classes of 100-300 students. This allowed me to actually communicate with students an a much more one-on-one level. Furthermore, since most of the work was done in small team settings, I got to know what each group of students was working on for their projects.

I worry that what I delivered wasn't exactly big data, and was really more of a mixed bag of data science. However, there was a lot of feedback from the students that they found it valuable, and value-added was what the goal all along.