Statistics et al.: 2019

Saturday, 9 November 2019

Basketball Data Science with Applications in R - An Advance Review

My overall impression of "Basketball Data Science" is that it's exactly the sort of book I would recommend to an instructor or able student of statistics in sport. Most of my criticisms are because it's not the book I imagine I would have written, not because the book isn't good.

Postmortems from 'Game Developer', a book review

"Postmortems from 'Game Developer'", edited by Austin Grossman is a collection of after-the-fact analyses of popular video games that had been recently developed and sold. The analyses are all written by senior members of each game's development team, and usually the game's head creator. "Postmortems" was published in 2003, so the games being analyzed include Diablo 2, Age of Empires, System Shock 2, and Black and White.

I purchased my copy from a thrift store in 2016, which had a suggested Canadian price of $42. This copy had several discount stickers on it, one of which had the book for sale for $0.84, a 98% discount.

Offsetting the carbon emissions of the blog, and then some.

Worrying about climate change is wearing down my sanity, and I'm know I'm not alone. I wanted to find a way let others reduce the amount of CO2 and methane in the air that didn't cost them money. One way is to buy carbon offsets with money from advertisements, like the ones that roughly 10% of you see at the top and side of this blog.

Is that futile?

How much carbon emissions are produced by a visit to a website like this? How does that compare to the cost of offsets? This is going to be a rabbit-hole of citations and unit conversion, so buckle in.

Tips to successfully web scrape with a macro

Recently, I updated the cricket simulator that I made in grad school, which entailed gathering four years of new T20I, IPL, and ODI cricket data. That's about 1000 matches, and the website ESPNcricinfo has been dramatically updated since I first scraped it. For that matter, so have the tools in R for scraping.

However, for one part, the play-by-play commentaries, nothing was working, and I ended up relying on recording and repeating mouse-and-keyboard macros. It's crude, but the loading-as-scrolling mechanic was just too hard to deal with programmatically, even with otherwise very powerful Rselenium.

Using macros to scrape pages is a trial-and-error process, but with the following principles, you can drastically reduce the number of trials it takes to get it right.

Alternatives to cryptocurrency for seasteads

Frequently, cryptocurrency and blockchain technology (crypto for short) is cited as a necessity for a functioning seastead. In the long term, when or if seasteads are large enough and economically important enough to fight for sovereignty, having your very own currency makes sense.

However, Cryptocurrency works best in places with a robust internet infrastructure to back up digital transactions, and where there is a critical mass of people in an area willing to trade their goods and services for the currency. In the early days of floating hamlets though, better alternatives to crypto exist.

Making Crossword Entries from Jeopardy! Before and After Clues

Before and After clues in Jeopardy! are clues pointing to two answers in which the last part of one answer is also the first part of the second answer. For example, "Supernatural kids' cartoon meets Star Wars prequel" could be a clue to both "Danny Phantom" and "The Phantom Menace", which would be shortened to "Danny Phantom Menace".

These are a weak spot for me in Jeopardy, so I tried to make some more crosswords using only the 'before and after' clues on from Jeopardy! as found in the J-archive. It worked, sort of.

Book review: Big Data by Timandra Harkness

I picked up Big Data by Timandra Harkness solely on the testimonials of Hannah Fry and Matt Parker on the front and back covers. The 2017 printing that I read is a 300-odd page general interest book about recent advances in big data.

"Big data" starts off pretty boilerplate for the topic – with a lot of definitions about what makes data "big"; volume, variety, velocity, and the like. It also gives some historical context about the growth of data over time, through early censuses, primitive computers, to today. The rest of this book is the result of interviews across the world with people working on different big data projects.

Seasteading Economic Opportunities Overview

This is an attempt to start a conversation about different means of making a living while seasteading. From chatter online, I've heard many of the same intended sources of income: bitcoin and other crypto mining, playing the stock market, tourism, and freelance software engineering.

These strategies won't work. Not because they are bad ways to make a living, but because the supply will quickly outsize the demand, no matter how good each worker is at programming, or how amazing each oceanic hotel is. Somebody has to grow the food for all these service and knowledge workers.

In short, we need economic diversity.

Below are some seasteading-based economic opportunities, arranged by their distinct advantages.

Replication report: Multilevel-Linear-Models using SAS and SPSS for repeated measures designs

The following is a report on the reproduction of the statistical work in the paper “Differences of Type I error rates for ANOVA and Multilevel-Linear-Models using SAS and SPSS for repeated measures designs" by Nicolas Haverkamp and André Beauducel at the University of Bonn.

Reversi in R - Part 2: Graphics and Custom Boards

In this post, I finish the Reversi / Othello game in R by improving the graphics, adding the ability to save and load boards, and fixing bugs. Also, many more boards have been added and tested, including those with unusual shapes, three or more players, and walls that can make the board into unusual shapes or even break it in half.

Reading questions: Struck by Lightning

The book Struck by Lightning, by Jeffery Rosenthal hits that balance of scientific correctness and approachability just right for a general audience book on probability. It’s been in print for 14 years now, and was a Canadian bestseller, so there’s nothing new that would come from a traditional review and therefore I won’t write one.

Instead, two points:

1. It should be required K-12 or 100-level math/stats reading.

Reversi in R - Part 1: Bare Bones

In this post, I showcase a bare-bones point-and-click implementation of the classic board Reversi (also called Othello*) in the R programming language. R is typically used for more serious, statistical endeavors, but it works reasonably well for more playful projects. Building a classic game like this is an excellent high-school level introduction to programming, as well as a good basis for building and testing game AI.

I read this: New Rules for Classic Games

New Rules for Classic Games, by R. Wayne Schmittberger, written in 1992, is exactly what it sounds like. "New Rules" contains possible amendments to rules for Risk, Monopoly, Poker, Bridge, Scrabble, Reversi/Othello*, Shogi, Go, and of course Chess.

Annual Report to Stakeholders 2018-19

Every year in grad school I had to write a report on my research and academic progress. I found it a useful exercise so I've continued to do so as a faculty member and post-doc.

Summary:

Professionally, this year was a struggle just to keep my head above water. I expect next year to be more productive in general, as well as more research oriented.

Replication Report - Informative Priors and Bayesian Updating

The following is a report on the reproduction of the statistical work in the paper "The use of informative priors and Bayesian updating: implications for behavioral research" by Charlotte O. Brand et al.

The original paper was accepted for publication by Meta-Psychology, https://open.lnu.se/index.php/metapsychology, a journal focused on methodology and reproductions of existing work. This report continues my attempt to establishing a standard for replication reports according to the Psych Data Standards found here https://github.com/psych-ds/psych-DS

Package Spotlight: anim.plots

The package anim.plots behaves like a sort of user-friendly shell on top of animate that makes animations of some of the most common types of plots in base R in a more intuitive fashion that animate.

This package depends on two other important packages:

- magick, which is an R implementation of imageMagick, which itself is software used to create animated gifs from still images.

- animation, which is an R library that can be used to create animations from any collection of plots.

President's Trophy - A Curse by Design

There are lots of ways the NHL rewards failure and punishes excellence, like the player draft, ever shrinking salary caps, and the half-win award for participating in overtime, but even the way in which playoff pairings are decided has a perverse incentive.

There are three rewards to doing well in the NHL regular season:

- 1. Going to the playoffs,

- 2. A favorable first round pairing in said playoffs,

- 3. Home team advantage for playoff games more often.

Here I argue that the first-round pairings are not as favorable as they could be.

Natural Language Processing in R: Edit Distance

These are the notes for the second lecture in the unit on text processing. Some useful ideas like exact string matching and the definitions of characters and strings are covered in the notes of Natural Language Processing in R: Strings and Regular Expressions

Edit distance, also called Levenshtein distance, is a measure of the number of primary edits that would need to be made to transform one string into another. The R function adist() is used to find the edit distance.

adist("exactly the same","exactly the same") # edit distance 0 
adist("exactly the same","totally different") # edit distance 14

Natural Language Processing in R: Strings and Regular Expressions.

In this post, I go through a lesson in natural language processing (NLP), in R. Specifically, it covers how strings operate in R, how regular expressions work in the stringr package by Hadley Wickham, and some exercises. Included with the exercises are a list of expected hang-ups, as well as an R function that can quickly check the solutions.

This lesson is designed for a 1.5-2 hour class for senior undergrads.

Contents:

Strings in R

Strings can be stored and manipulated in a vector
Strings are not factors
Escape sequences
The str_match() function

Regular expressions in R

Period . means 'any'
Vertical line | means 'or'
+, *, and {} define repeats
^ and $ mean 'beginning with' and 'ending with'
[] is a shortcut for 'or'
hyphens in []
example: building a regular expression for phone numbers

Exercises

Detect e-mail addresses
Detect a/an errors
Detect Canadian postal codes

Writing R documentation, simplified

A massive part of statistical software development is the documentation. Good documentation is more than just a help file, it serves as commentary on how the software works, includes use cases, and cites any relevant sources.

One cool thing about R documentation is that it uses a system that allows it to be put into a variety of different formats while only needing to be written once.

Bingo analysis, a tutorial in R

I'm toying with the idea of writing a book about statistical analyses of classic games. The target audience would be mathematically interested laypeople, much like Jeffrey Rosenthal's book Struck by Lightning ( https://www.amazon.ca/Struck-Lightning-Jeffrey-S-Rosenthal/dp/0006394957 ).

The twist would be that chapter would contain step-by-step R code or Python code so that the reader could do the same analysis and make changes based on their own questions. Material would like this post on Bingo, as well as my previous post on Snakes and Ladders ( https://www.stats-et-al.com/2017/11/snakes-and-ladders-and-transition.html ).

There would also be some work on chess variants, othello, poker, and possibly go, mahjong, and pente. Tied to each analysis could be light lessons on statistics. This Bingo analysis involves Monte Carlo style simulation, as well as notes on computing expected values, CDFs and PDFs.

Dataset - The Giant Marmots of Moscow

Stat 403/640/890 Analysis Assignment 3: Polluted Giant Marmots

Due Wednesday, April 3^rd

Drop off in the dropbox by the stats workshop, or hand in in class.

For this assignment, use the Marmots_Real.csv dataset.

Main goal: The giant marmots of Moscow have a pollution problem. Find a model to predict the pollutant concentration (mg per kg) in the local population without resorting to measuring it directly. (It turns out that measuring this pollutant requires some invasive measures like looking at bone marrow).

The dataset Marmots_real.csv has the data from 60 such marmots, including many variables that are easier measure:

Variable Name	Type	Description
Species	Categorical, Unordered	One of five species of giant marmot
Region	Categorical, Unordered	One of five regions around Moscow where the subject is captured
Age	Numerical, Continuous	Age in years
Pos_x	Numerical, Continuous	Longitude, recoded to (0,1000), of capture
Pos_y	Numerical, Continuous	Latitude, recoded to (0,1000), of capture
Long_cm	Numerical, Continuous	Length nose to tail in cm
Wide_cm	Numerical, Continuous	Width between front paws, outstretched
Sex	Binary	M or F
Lesions	Numerical, Count	Number of skin lesions (cuts, open sores) found upon capture
Injured	Binary	0 or 1, 1 if substantial injury was observed upon capture.
Teeth_Condition	Categorical, Ordered	Condition of teeth upon capture, listed as Very Bad, Bad, Average, or Good.
Weight	Numerical, Continuous	Mass of subject in 100g
Antibody	Numerical, Continuous	Count of CD4 antibody in blood per mL
Pollutant	Numerical, Continuous	mg/kg of selenium found in bone marrow

There are no sampling weights. There is no missing data. There should be little to no convergence or computational issues with this data.

Assignment parts:

1) Build at least three models for pollutant and compare them (e.g. r-squared, general parsimony). Be sure to try interactions and polynomial terms.

Select one to be your ‘model to beat’.

2) Check the diagnostics of your model to beat. Specifically, normality of residuals, influential outliers, and the Variance Inflation Factor. Comment.

3) Try a transformation of the response in your model to beat, and see if you can improve the r-squared.

4) Try a PCA-based model and see if it comes close to you model to beat.

5) Take your ‘model to beat’ and add some terms to it. Call this the ‘full model’, and use that as a basis for model selection using stepwise and the AIC criterion. Is the stepwise-produced model better (r-squared, AIC) than your ‘model to beat’?

6) If you haven’t already, try a random effect of something appropriate, and see if it beats the AIC of the stepwise model. Use the AIC() function to see the AIC of most models.

Useful sample code:

######## Preamble / Setup

## Load the .csv file into R. Store it as 'dat'

dat = read.csv("marmots_real.csv")

dat$region = as.factor(dat$region)

library(car) # for vif() and boxcox()

library(MASS) # for stepAIC()

library(ks) # for kde()

library(lme4) # lmer and glmer

##### Try some models, with interactions

### Saturated. Not enough DoF

mod = lm(antibody ~ species*region*age*weight + long_cm, data=dat)

summary(mod)

### Another possibility, enough DoF, but efficient?

mod = lm(antibody ~ species + region + age*weight*long_cm, data=dat)

summary(mod)

vif(mod)

plot(mod)

AIC(mod)

### With polynomials

mod = lm(pollutant ~ age + long_cm + wide_cm + I(sqrt(age)) + I((long_cm*wide_cm)^3), data=dat)

summary(mod) ## High r-sq and little significance? How?

vif(mod) ## Oh, that's how.

#### Model selection,

mod_full = lm(antibody ~ species + region + age*weight*long_cm + I(log(wide_cm)) + lesions, data=dat)

### Stepwise selection based on AIC

stepAIC(mod_full)

### What if we do a BIC penalty

## Try without trace=FALSE so we can see what's going on.

stepAIC(mod_full, k=log(nrow(dat)))

######### Transformations

### Start with the classic tranforms

### Another possibility, enough DoF, but efficient?

mod = lm(sqrt(pollutant) ~ species + region + age*weight*long_cm, data=dat)

summary(mod)

mod = lm(log(pollutant) ~ species + region + age*weight*long_cm, data=dat)

summary(mod)

### Box-cox to find the range of best ones through

boxcox(pollutant ~ species + region + age*weight*long_cm, data = dat,

lambda = seq(-2, 3, length = 30))

boxcox(antibody ~ species + region + age*weight*long_cm, data = dat,

lambda = seq(-2, 3, length = 30))

### Anything above the 95% line is perfectly fine. Anything close is probably fine too.

### Reminder:

### Lambda = -1 is 1/x (inverse) transform

### lambda = 0 is log tranform

### Lambda = 1/2 is sqrt tranform

### Lambda = 1 is no tranform

### Lambda = 2 is square transform

#################

### MANOVA

### First, Are the two responses related?

cor(dat$antibody, dat$pollutant)

plot(dat$antibody, dat$pollutant)

### Start with the simple ANOVAs

mod_anti = lm(antibody ~ species + region + age*weight*lesions, data=dat)

mod_poll = lm(pollutant ~ species + region + age*weight*lesions, data=dat)

aov_anti = anova(mod_anti)

aov_anti

summary(aov_anti)

aov_poll = anova(mod_poll)

aov_poll

summary(aov_poll)

### Now try the multiple ANOVA

aov_mult = manova(cbind(antibody, pollutant) ~ species + region + age*weight*lesions)

aov_mult

summary(aov_mult) ### Your job: Make a model that balances simplicity with fit.

## Residual standard errors: Lower = better fit

###################

# PCA

### convert the relevant categorical variables

dat$teeth_num = as.numeric(factor(x = dat$teeth_condition, levels=c("Very Bad","Bad","Average","Good")))

dat$sex_num = as.numeric(factor(x = dat$sex, levels=c("F","M")))

PCA_all = prcomp( ~ age + weight + lesions + long_cm + wide_cm + injured + teeth_num + sex_num,

data = dat,

scale = TRUE)

summary(PCA_all)

plot(PCA_all, type="lines")

### Add the Principal components to the marmots dataset

dat = cbind(dat, PCA_all$x)

head(dat)

### Try a few models of the responses using the PCAs

mod_PCA1 = lm(antibody ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6, data=dat)

summary(mod_PCA1)

mod_PCA2 = lm(antibody ~ PC1 * PC2 * PC3 , data=dat)

summary(mod_PCA2)

mod_PCA3 = lm(antibody ~ PC1 + PC2 + PC3 , data=dat)

summary(mod_PCA3)

mod_PCA4 = lme(pollutant ~ PC1 + PC2 + PC3 + (1|region), data=dat)

summary(mod_PCA4) ### Why non-zero correlations? Adjustments for region were made

### Fixed vs Random

marmots$region = as.factor(marmots$region)

summary(lm(pollutant ~ region, data=dat))

summary(lmer(pollutant ~ (1|region), data=dat))

summary(lmer(pollutant ~ PC1 + PC2 + (age|region), data=dat))

summary(lmer(pollutant ~ age + (1|region), data=dat))$logLik ### Higher LogLik is better

summary(lmer(pollutant ~ age + (1|region), data=dat))$AICtab ### Lower REML (AIC calculated by REML) is better

### Compare the $AICtab value to the result from stepAIC

Saturday, 23 March 2019

What makes a good data dictionary?

A data dictionary is a guide, external to the dataset in question, that explains what each variable is in a human-readable format. In R, the programming equivalent of a data dictionary is the result of the str() function, which will show the first few values of each variable, the format (e.g. numerical, string, factor), and other important information (e.g. the first few levels of the factor).

A data dictionary may include software-specific features, but it should still be value to anyone using the dataset, regardless of the software they are using.

Other than that, what makes a good data dictionary?

Reading Assignments - Split-Plot Design, Magnitude-Based Inference

This semester, I'm teaching a new (to me) course that's heavy into design of experiments and biostatistics, which means I needed some new reading assignments. First, a survey of applications of split-plot designs for fisheries. Next, a seminal paper on magnitude-based inference, written for physiologists. Non-paywalled links to the papers included.

Alternatives to the P value

Here are some things you can find and report as alternatives to the p-value: confidence intervals, Bayes factors, and magnitude-based interference. These are used mostly the same situations, but all of them are more informative especially in combination with the P value.

Lingering questions from the 2018 MLB season

Here's a few more comments and ongoing questions about Major League Baseball that I wanted to post but didn't fit anywhere else. Just a few more weeks until spring training!

Inside: Pitch count superstitions, base coach evaluation, WAR in blowouts, and anecdotes of SafeCo.

The Vestigial Reference Letter

The graduate school reference letter is a holdover from a time when academics was much more of a closed off clique than the relatively open network of today. Back then, references served to keep the world of higher education reserved for insiders. Nowadays, these letters are vestigial and purposeless; they only serve to waste the time of applicants, referees, and admissions people.

Replication Report - Signal Detection Analysis

The following is a report on the reproduction of the statistical work in the paper “Insights into Criteria for Statistical Significance from Signal Detection Analysis” by Jessica K. Witt at Colorado State University.

The original paper was accepted for publication by the Journal of Meta-Psychology, a journal focused on methodology and reproductions of existing work. This report on is my first attempt at establishing a standard template for future such reports according to the Psych Data Standards found here https://github.com/psych-ds/psych-DS

I read this: Smart Baseball

Smart Baseball, by Keith Law, exposes many common baseball statistics like batting average for the garbage they are. He does this in an approachable, narrative style that will appeal to long-time baseball fans. It's very light on statistics, but Law obviously knows his stuff; he just isn't interested in showing off the details.

Venn Diagram Stats Memes

Statistics is great but it doesn't have the best meme portfolio, at least not since Biostatistics Ryan Goslling.

Let's try to fix that.

More after the break.

Statistical Vocabulary Crossword Puzzles

I've expanded the statistical thesaurus again, as well as made some crossword puzzles from them, for vocabulary study.

Featured post

Saturday, 9 November 2019

Friday, 8 November 2019

Thursday, 3 October 2019

Sunday, 29 September 2019

Friday, 6 September 2019

Thursday, 5 September 2019

Monday, 26 August 2019

Wednesday, 14 August 2019

Thursday, 8 August 2019

Wednesday, 7 August 2019

Sunday, 21 July 2019

Tuesday, 2 July 2019

Saturday, 22 June 2019

Summary:

Thursday, 20 June 2019

Tuesday, 28 May 2019

Tuesday, 16 April 2019

Tuesday, 9 April 2019

Sunday, 7 April 2019

Monday, 1 April 2019

Tuesday, 26 March 2019

Saturday, 23 March 2019

Thursday, 21 February 2019

Sunday, 17 February 2019

Tuesday, 12 February 2019

Friday, 8 February 2019

Saturday, 26 January 2019

Friday, 25 January 2019

Sunday, 13 January 2019