Featured post

Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

Friday 22 April 2016

nhlscrapr revisited

VanHAC, the Vancouver Hockey Analytics Conference, was April 9th, and I was presenting a tutorial on the nhlscrapr package for R.

This post is excerpts from the code I presented and gave out at the tutorial. The full tutorial expands my of my previous 'package spotlight' post on nhlscrapr. This post only includes the bare bones of downloading the raw games, examining the rate of goals scored and shots fired throughout the game, and making a basic player summary.

Also included is a patch to nhlscrapr I wrote that fixes a couple of functions ( full.game.database() , player.summary() ) that were throwing some errors, and adds a function ( aggregate.roster.by.name() ) that aids in matching player summaries to the proper names.

You can load the nhlscrapr package and use the patch with the following code:

source("Patch for nhlscrapr.r")

After which you can do things like define seasons after 2014-15 to extract. The following line builds a data frame of game IDs with one season beyond 2014-15.

fgd = full.game.database(extra.seasons = 1)

Which lets you download, process, and compile the games in the 2015-16 season

thisyear = "20152016"
game_ids = subset(fgd, season == thisyear)

dummy = download.games(games = game_ids, wait = 5)
compile.all.games(output.file="NHL play-by-play 2014-5.RData")

After extraction and compiling, we have two files

# .. the events logtemp = load("source-data\\nhlscrapr-20142015.RData")
ev_all = get(temp)

#.. and the player roster
temp = load("source-data\\nhlscrapr-core.RData")
roster = get(temp)

Analysis Snippet 1: Goals and Shots per minute of play

# First define a minutes variable based on the existing variable 'seconds'
ev_all$minutes = floor((ev_all$seconds - 0.5) / 60)
ev_all$minutes = pmax(ev_all$minutes, 0)

# Isolate the database to situations that are 
# in regulation time and during the regular season
ev_reg =  subset(ev_all, period <= 3 & gcode < 30000 & gcode >= 20001)
ev_goals = subset(ev_reg, etype == "GOAL")
ev_shots = subset(ev_reg, etype == "SHOT")
goals_per_hr = as.numeric(table(ev_goals$minutes)) / 1230 * 60
shots_per_hr = as.numeric(table(ev_shots$minutes)) / 1230 * 60

 Which produces plots like these..

Note the big jump at the end of the game, removing the empty net goals removes the jump entirely.

Few shots at the beginning of each period, and a downward trend. Could this be warm-up and fatigue?

Analysis Snippet 2: Sedin Summary

If we just want the basic event counts, we can use the roster to find the player IDs for the Sedin twins and see the number of goals, shots, hits, etc. they had in the 2015-16 season.

sedins = subset(roster,last=="SEDIN")$player.id 
ev_sedin = subset(ev_all, ev.player.1 %in% sedins) 

table(ev_sedin$ev.player.1, ev_sedin$etype)

We can also get player summaries for more complex events using the player.summary() function

roster_name = aggregate.roster.by.name(roster)
ps = player.summary(ev_all, roster_name) 

The output of player.summary() is an array of 5 tables
The first table is the person that did the event (e.g. scored the goal, got the penalty, made the shot, miss, or blocked the shot)
player_summary = as.data.frame(ps[,,1])

The second table is the second person in the event, if relevant. (i.e. 1st assist, victim in penalty (?)), 2nd block (?))
The third table is the third person in the event, if relevant. (only 2nd assists)
player_summary$ASSIST = ps[,3,2] + ps[,3,3] # Third column is the GOAL event

The fourth table is anyone who was on ice when the event happened and it was their team that was ev.team
The fifth table is anyone who was on ice when the event happened and it was the opposing team that was ev.team
ev.team refers to the team that scored, took the shot, won the faceoff, or received the penalty
player_summary$PLUSMINUS = ps[,3,4] - ps[,3,5]
player_summary$PLUSMINUS_SHOTS = ps[,2,4] - ps[,2,5]

Finally, we can use information from the roster to fill in the name information
roster_name = subset(roster_name, firstlast %in% rownames(player_summary))
name_idx = match(row.names(player_summary), roster_name$firstlast)
player_summary$firstlast = roster_name$firstlast[name_idx]

And we can look at the sedins again for comparison
subset(player_summary, last == "SEDIN")

Apologies to A.C. Thomas, the author of nhlscrapr, if I'm stepping on your toes with this patch.

1 comment:

  1. I keep getting thrown an error on compile.all.games. 1:max(roster.master$player.id) : result would be too long a vector.
    Warning Message: no non-missing arguments to max; returning -Inf