Featured post

Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

Tuesday 16 April 2019

President's Trophy - A Curse by Design


There are lots of ways the NHL rewards failure and punishes excellence, like the player draft, ever shrinking salary caps, and the half-win award for participating in overtime, but even the way in which playoff pairings are decided has a perverse incentive.


There are three rewards to doing well in the NHL regular season:
-         1. Going to the playoffs,
-          2. A favorable first round pairing in said playoffs,
-          3. Home team advantage for playoff games more often.

Here I argue that the first-round pairings are not as favorable as they could be. 

Tuesday 9 April 2019

Natural Language Processing in R: Edit Distance

These are the notes for the second lecture in the unit on text processing. Some useful ideas like exact string matching and the definitions of characters and strings are covered in the notes of Natural Language Processing in R: Strings and Regular Expressions

Edit distance, also called Levenshtein distance, is a measure of the number of primary edits that would need to be made to transform one string into another. The R function adist() is used to find the edit distance.

adist("exactly the same","exactly the same") # edit distance 0 
adist("exactly the same","totally different") # edit distance 14


Natural Language Processing in R: Strings and Regular Expressions.

In this post, I go through a lesson in natural language processing (NLP), in R. Specifically, it covers how strings operate in R, how regular expressions work in the stringr package by Hadley Wickham, and some exercises. Included with the exercises are a list of expected hang-ups, as well as an R function that can quickly check the solutions.

This lesson is designed for a 1.5-2 hour class for senior undergrads.

Contents:
  • Strings in R
    • Strings can be stored and manipulated in a vector
    • Strings are not factors
    • Escape sequences
    • The str_match() function
  • Regular expressions in R
    • Period . means 'any'
    • Vertical line | means 'or'
    • +, *, and {} define repeats
    • ^ and $ mean 'beginning with' and 'ending with'
    • [] is a shortcut for 'or'
    • hyphens in []
    • example: building a regular expression for phone numbers 
  • Exercises
    • Detect e-mail addresses
    • Detect a/an errors
    • Detect Canadian postal codes

Sunday 7 April 2019

Writing R documentation, simplified


A massive part of statistical software development is the documentation. Good documentation is more than just a help file, it serves as commentary on how the software works, includes use cases, and cites any relevant sources.

One cool thing about R documentation is that it uses a system that allows it to be put into a variety of different formats while only needing to be written once.

Monday 1 April 2019

Bingo analysis, a tutorial in R



I'm toying with the idea of writing a book about statistical analyses of classic games. The target audience would be mathematically interested laypeople, much like Jeffrey Rosenthal's book Struck by Lightning ( https://www.amazon.ca/Struck-Lightning-Jeffrey-S-Rosenthal/dp/0006394957 ).

The twist would be that chapter would contain step-by-step R code or Python code so that the reader could do the same analysis and make changes based on their own questions. Material would like this post on Bingo, as well as my previous post on Snakes and Ladders ( https://www.stats-et-al.com/2017/11/snakes-and-ladders-and-transition.html ).

There would also be some work on chess variants, othello, poker, and possibly go, mahjong, and pente. Tied to each analysis could be light lessons on statistics. This Bingo analysis involves Monte Carlo style simulation, as well as notes on computing expected values, CDFs and PDFs.