Featured post

Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

Wednesday 14 February 2018

Freakazoid, a Repetition-Robust Symmetric Cipher.

This post describes the mathematical principles behind the Freakazoid cipher such that someone could recreate it with software if they wished. For a targeted guide to some of the terminology used here, please see my previous post discussing some fundamentals of encryption.

It is not meant to be a serious attempt at a cipher, and is presented as food for thought only. It was originally developed more than ten years ago as part of a semester of research as a math undergrad.

Freakazoid is a cipher based on DES, an old cipher that was used as an industry standard for a time, but with the added step that a new key is generated each block of data to be encrypted. The generation of this key is governed by a master key. For clarity, we call a key applied to a particular block of data the block key.

Tuesday 13 February 2018

Some Timely Fundamentals of Encryption


This post is in response to some questions about the mechanics of encryption that have been getting more frequent as cryptocurrency and related technologies become more well-known.

Wednesday 7 February 2018

Template for a technical report, with example rubric


A completed technical report might look like this:
1. Executive Summary
2. Introduction / Problem statement
3. Methods
4. Results
5. Conclusion / Discussion

 Executive Summary: The TL;DR

This is the LAST thing that you should write. This would be the tl;dr of the technical report. “tl;dr” stands for “too long; didn’t read”. More formally this is called the executive summary, which means ‘if this report was given to a major decision maker, whom has tons of things they need to know already, what would you like them to know from the report that can be reduced to 100 words or fewer.’

Here you should write the research question as shortly as you can, one main result, and the name of the main method used. Nothing from the discussion / conclusion section is needed here.


Introduction:

The introduction typically follows a close formula.

1. Describe the research problem or state the research questions that were posed. If you can, tell why this research problem is important. The explanation of importance doesn’t have to be too specific to the research problem. If you are working with data about a medical problem, mention that many people suffer this medical problem; in a research paper, this is a good opportunity to cite a well-known related paper that has found the scope of the problem for you.

If you don’t know why a problem is important and a quick literature search won’t tell you, leave the problem’s importance to a co-author whose expertise is more suited to this part. It’s much better to admit you don’t know something than to say something wrong.

2. Describe each section in the paper or report in very short detail. (e.g. “in the methods section, we describe the data cleaning and the regression tree method that we used. In the results section, we describe the goal scoring rate of different hockey players. In the discussion section, we follow up with a comparison of this method to an older, more traditional one.”)

Methods

The methods: What did you do to get these results?

If this were a field science, you would list the days and describe the conditions under which you went out into the field and gathered information (e.g. ‘we collected our samples on sunny days in the North Okanagan valley between June 10th and September 20th, 2015’). In a data science, you would instead describe the dataset that you used, its format and size, and key variables and features (e.g. ‘We gathered the data from NHL.com’s event-tracking database using the nhlscrapr package along with our own patch, The data we collected included each goal, shot, hit, penalty, and faceoff recorded in each regular season game from October 2012 to April 2017’)

This is where the bulk of your writing should be. About 50% of your report will be the methods section. You don’t need to explain the entire data cleaning process, but you should mention where the data came from, and the tools / software that were used. It’s also good practice to mention when the data was taken (especially in the case of news reports which may be updated, altered, or archived such that scraping may produce different results later).

If there were any judgement calls in your data cleaning process, such as…
- what was done about extreme and influential cases,
- how problematic variables were used,
- how tuning parameters for complex methods were selected, and
- how missing values were either filled in or explained away,
…these should be included as well.

In short, you don’t have to give everything away, but an expert with the same software and data access should be able to recreate what you did.

A methods section serves two purposes: first to give legitimacy to your results. If you show results without explaining how you got them, a reader might assume that the results were invented or made up. With a methods section, the reader should be able to see a logical path between the data and the results.

After the data preparation is explained, describe the model you selected or the process you used to select the model. If you just did linear regression, say that. If you used a random forest, or the LASSO, or stepwise regression, say that instead.

Normally, you only need to include the final method that you decided upon. However, there is a good chance that the method you used wasn’t the only method that you tried. In a research paper, you wouldn’t necessarily mention these ‘dead ends’ because paper length is limited by the journal. In a technical report (or a thesis) these other approaches are useful to help you justify your choice and that alternatives were considered. You can explain why these rejected methods didn’t work or what about the results they produced was bad. Don’t overdo these dead-end explanations. The reader is much more interested in what you did and what worked instead of what didn’t work, typically.

Example: “After an exploratory analysis, we tried to classify events using random forests, dimension reduction, and neural nets. We decided to further pursue neural nets because they produce models with much lower out-of-bag errors than other approaches.”

Results

It’s easiest to write the results first, even though they don’t appear first. Any tables of figures you want to show, make these as soon as the analysis work is done. Talk about your results a little. Explain the importance of any tables and figures; why are they there?

Mention the general trend (e.g. ‘there is a negative, non-linear trend between playing time per game and shots against goal’), and any notable observations (‘however, the New Jersey Devils break this trend’)

You don’t need to write much here. The charts should explain themselves.


Discussion  / Conclusion:

In a technical report, this is where you take the results and give them meaning in the context of the research questions that were in the introduction. You can also quickly summarize what you did.
In a journal paper or a thesis, this section might also include future research questions that could be answered with more data or by a different analysis. A technical report should be more self-contained, and allusions to the further work are not required.

In every case, no new information about the project should in introduced in the conclusion. If you have an interesting finding, it should be in the results. If that interesting finding doesn’t fit with the rest of the results, a new subsection for it can always be made, but keep it out of the discussion section.

Remember, when giving context to the results, don’t reach beyond your expertise. If the data is genetic, and you are not a geneticist or biologist, do not make conclusions about the importance of a gene. Often statistical publications are co-authored with subject experts; let those experts write about their topics and stick to the data analysis.


Example Rubric: Total out of 100
Length
3-6 full pages
10


7 pages
8


8 pages
6


More than 8
0


2.5 pages
8


2 pages
4


Less than 2 full
0



Possible 10
/10




Grammar
Start with 10, any OBVIOUS grammar or spelling mistakes, reduce this by 2. Minimum 0/10




Possible 10
/10




Executive Summary
Name of main method included
3


Primary finding described
7



Possible 10
/10




Introduction
Describes the research problem
6


Makes a case for the its importance
4



Possible 10
/10




Methods
Describes the data used
5


Describes the data preparation and/or data cleaning
5


Describes any decisions / judgement
5


Describes method used
10


Justifies choice of method
5



Possible 30
/30




Results
Summarizing data through text, table, or a figure
10


Description of a general trend
5



Possible 15
/15




Conclusion
Ties results back to research question
10


Does NOT include any new information that would be better in the results or methods sections.
5



Possible 15
/15





TOTAL

/100