Saturday, 29 April 2017

A Long term vision for a Master's Degree in Data Science Program

I was recently asked to write a statement of vision for a potential one-year coursework master's program. A colleague was kind enough to look at it for me; he said was 'a bit too bombastic'.

This statement has already been submitted to its intended destination.





----------

Successful data scientists need to be able communicate not just verbally and in writing, but also
visually by way of graphs, dashboards, and animations. They need to have working knowledge of
modern database languages like SQL and big data architectures like Hadoop. They need to be able to
determine when to use modern statistical, machine learning, and optimization methods like the LASSO, neural networks, and random forests.

But successful students won't just have a practitioner's knowledge of these tools, because these tools
will be replaced eventually. They also need the depth in their backgrounds to evaluate and adapt to
additional systems and methods as they become available.

Therein lays the challenge: the demands upon a data scientist are broad, whereas a Master's degree is
typically a structured, focused, deep exploration of a single field. There simply isn't enough time to
cover all that's necessary to develop a prospective student starting with a bachelors degree in
Mathematics or Computer Science into a consummate data scientist in ten months. 

Some topics will need to be sacrificed for the sake of brevity, but what? Different students will bring
diverse skills and affinities into such a degree program, and they will be good judges of what they
should focus on. However, a degree is essentially a set of requirements, which is another way of saying it's a set of guarantees. The better defined those requirements are, the clearer the guarantee of skill that the bearer of such a degree brings to future employers.

In the face of program that will inevitably be stretched thin across many competencies, there are two
competing needs: the need for students to develop the subset of these competencies that maximize their personal return, and the need for standardization across the program to make its value and quality obvious to all stakeholders. To reconcile these two needs, I envision a specialization system. Graduates from the ideal Master in Data Science program will also graduate with one of four specialties: visual analytics, databases, methodology, or algorithms.

Under this specialization program, all MDS candidates will be required to take a core of data scraping, imputation, R or SAS programming with an SQL component, modern regression such as GLMs, and scientific writing. This totals 15 graduate credits. The remaining 9 credits form a specialty.

Database experts would be most akin to software engineers. The courses for this specialty would
include one focused on the extract-transform-load paradigm, one focused on handling big data tools
like Hadoop. Graduates from this specialty would be expected to be able to implement automated tasks
for gathering, cleaning, and summarizing information from the web or some other digital sensor.

Methologists would take applied statistical courses like design of experiments, sampling, dimension
reduction, time series, and spatial statistics. What separates this specialty from a Master's degree in
statistics is the lack of emphasis on proofs. Students in these courses need not understand why a
method works, only how to assess through diagnostics and checklists that it is working and when it is
appropriate.

Visual analytics specialists would take courses focused on user interfaces and communication,
including graphing and data cartography, dashboards, and additional writing work such as survey
design. Graduates from this specialty would be expected to demonstrate familiarity with popular
database interface like Jaspersoft and Tableau.

Algorithm experts would focus their additional coursework on new ways to find meaning from the
data deluge. Their corpus would include machine learning, optimization methods like quadrature and
simulated annealing, text processing concepts such as regular expressions and edit distance, image
processing, clustering, compression and information theory.

A graduate with skills in any one of these four specialties fits nicely under what we know as a data
science today.  This vision is a grand one, and far too large for a new master's program to take on, but
it's the endgame i have in mind for this program.