Saturday, 29 April 2017

A Long term vision for a Master's Degree in Data Science Program

I was recently asked to write a statement of vision for a potential one-year coursework master's program. A colleague was kind enough to look at it for me; he said was 'a bit too bombastic'.

This statement has already been submitted to its intended destination.


Successful data scientists need to be able communicate not just verbally and in writing, but also
visually by way of graphs, dashboards, and animations. They need to have working knowledge of
modern database languages like SQL and big data architectures like Hadoop. They need to be able to
determine when to use modern statistical, machine learning, and optimization methods like the LASSO, neural networks, and random forests.

But successful students won't just have a practitioner's knowledge of these tools, because these tools
will be replaced eventually. They also need the depth in their backgrounds to evaluate and adapt to
additional systems and methods as they become available.

Therein lays the challenge: the demands upon a data scientist are broad, whereas a Master's degree is
typically a structured, focused, deep exploration of a single field. There simply isn't enough time to
cover all that's necessary to develop a prospective student starting with a bachelors degree in
Mathematics or Computer Science into a consummate data scientist in ten months. 

Some topics will need to be sacrificed for the sake of brevity, but what? Different students will bring
diverse skills and affinities into such a degree program, and they will be good judges of what they
should focus on. However, a degree is essentially a set of requirements, which is another way of saying it's a set of guarantees. The better defined those requirements are, the clearer the guarantee of skill that the bearer of such a degree brings to future employers.

In the face of program that will inevitably be stretched thin across many competencies, there are two
competing needs: the need for students to develop the subset of these competencies that maximize their personal return, and the need for standardization across the program to make its value and quality obvious to all stakeholders. To reconcile these two needs, I envision a specialization system. Graduates from the ideal Master in Data Science program will also graduate with one of four specialties: visual analytics, databases, methodology, or algorithms.

Under this specialization program, all MDS candidates will be required to take a core of data scraping, imputation, R or SAS programming with an SQL component, modern regression such as GLMs, and scientific writing. This totals 15 graduate credits. The remaining 9 credits form a specialty.

Database experts would be most akin to software engineers. The courses for this specialty would
include one focused on the extract-transform-load paradigm, one focused on handling big data tools
like Hadoop. Graduates from this specialty would be expected to be able to implement automated tasks
for gathering, cleaning, and summarizing information from the web or some other digital sensor.

Methologists would take applied statistical courses like design of experiments, sampling, dimension
reduction, time series, and spatial statistics. What separates this specialty from a Master's degree in
statistics is the lack of emphasis on proofs. Students in these courses need not understand why a
method works, only how to assess through diagnostics and checklists that it is working and when it is

Visual analytics specialists would take courses focused on user interfaces and communication,
including graphing and data cartography, dashboards, and additional writing work such as survey
design. Graduates from this specialty would be expected to demonstrate familiarity with popular
database interface like Jaspersoft and Tableau.

Algorithm experts would focus their additional coursework on new ways to find meaning from the
data deluge. Their corpus would include machine learning, optimization methods like quadrature and
simulated annealing, text processing concepts such as regular expressions and edit distance, image
processing, clustering, compression and information theory.

A graduate with skills in any one of these four specialties fits nicely under what we know as a data
science today.  This vision is a grand one, and far too large for a new master's program to take on, but
it's the endgame i have in mind for this program.

Reading Assignment - Designing Survey Questions

In one of the second year service courses I taught this semester, some people were unable to do the participation assignment for non-academic reasons. This means I needed an alternative assignment, which gave me a chance to field test the following reading assignment.

This is based on Chapter 8 of the book Successful Surveys - Research Methods and Practice by George Gray and Neil Guppy. The chapter is "Designing Questions of thebook Successful Surveys."
This assignment turned out to be pretty easy, so it could work as a warm-up for upcoming statistical writing classes. I didn't receive feedback from the students on how long it took, but I imagine it would be 2-4 hours for an undergrad. Even if you don't want to use the same reading chapter as I did, many of these questions could be answerable with another reading source with minimal modification.

 Q1. Give an example of a numerical (e.g. quantitative) open-ended question and a numerical closed-ended question.
Q2. Give an example of a non-numerical (e.g. nominal, text-based) open-ended question and a non-numerical closed-ended question.
Q3. In your OWN WORDS, give two advantages and disadvantages of open-ended questions.
Q4. In your OWN WORDS, give two advantages and disadvantages of closed-ended questions.
Q5. How do field coded questions combine the features of both open- and closed-ended questions.
Q6. For what kind of surveys are open-ended questions more useful? When are they less useful?
Q7. What are five features that make for well worded questions.
Q8. What is the name used for a survey question that asks about and focuses on two distinct things?
Q9. What is a Likert scale?
Q10. What are four things that all have important effects on how people respond to survey questions?