Featured post

Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

Tuesday 17 May 2016

Course Notes on the PitchRx package

These are the notes I recently delivered as a guest lecturer for Simon Fraser University's course on Sports Analytics. It's a course for people with some experience with R, but not necessarily experts. As such, I made these notes with a beginners in mind.

If you want to see what PitchRx can really do, I recommend the below links. However, if you want to get started and you don't have much familiarity with R, and possibly none with SQL, the following notes are for you.


The PitchRx package is an R package designed to use a Major League Baseball dataset called pitchFx. As with nhlscrapr, there are other means of accessing this dataset, but my preference is usually towards R integration.

It gives you detailed pitch-by-pitch information about every MLB game, including the speed and location of the ball as it crossed homeplate.

Getting started with pitchRx is pretty quick:
The scrape() function in pitchRx will allow you scrape data from every game that happened during the range of days given. Days are in the YYYY-MM-DD format, which is used because...

1) It's the same format that SQL uses.
2) It's the ISO standard

dat = scrape(start = "2013-06-01", end = "2013-06-01")

The dataset 'dat' that comes out of this is a collection of five tables.

 "atbat"  "action" "pitch"  "po"     "runner"

'atbat': Describes the outcome of each at-bat. One row = one batter.

'action': Other events not related to at-bats, such as pitching changes, coaching visits to the mound, and managers getting ejected from the game.

'pitch' : Pitch-by-pitch description. One row = one pitch. Has lots of physics variables relating to each pitch, but lacks the text descriptions that accompany at-bats.

'po' : Pickoff attempt descriptions.

'runner': Description of where each runner ended up. Most of the rows correspond to at-bats. The other rows represent running events like advancing on someone else's hit, or being forced out.

We will focus on the pitch-by-pitch table.

The list of variables is... intimidating.

Thankfully, a lot of these are the same across all the tables.

des, des_es: The text description of the pitch in English or Spanish, respectively. Examples include "Ball", "Foul", "Strike", and "In play, run(s)".

num: The number of the at-bat for this game. Also used in the at-bat data frame.

count: The ball-strike count before the pitch occured.

start_speed, end_speed: The speed of the ball, in miles per hour, when ball reaches home plate, and when it leaves the pitcher's hand, respectively.

px: The horizontal position that the ball crosses the home-plate plane. Measured in feet left or right of the center of home plate, from the perspective of the catcher.

pz: The vertical position of the ball crossing the home-plate plane. Measured in feet above the ground.

nasty: The 'nasty factor', which is a function of physical variables that is supposed to describe how difficult a pitch is to hit.

spin_rate: The (mean?) rate which the baseball was spinning, in revolutions per minute (RPM).  Yes, some pitchers really do spin the ball at 2700 RPM!

zone (unconfirmed): The portion of the strike zone (or outside it) that a pitch crossed the plate.

Example analysis: Pitching count

One big issue in baseball is pitch count. As a pitcher, especially a starter, throws many pitches, they tire and their performance supposedly gets worse.

Is this true? Let's plot some variables against pitch count.

First, let's isolate one team of one game.

atbat_1game = subset(dat$atbat, inning_side == "top" & gameday_link == "gid_2013_06_01_arimlb_chnmlb_1")

pitch_1game = subset(dat$pitch, inning_side == "top" & gameday_link == "gid_2013_06_01_arimlb_chnmlb_1")

Next, we have to identify the pitcher that throws each pitch. We have to get this information from the at-bat table.

pitcher = rep(NA,nrow(pitch_1game))
for(k in 1:nrow(pitch_1game))
    thisnum = pitch_1game$num[k]
    pitcher[k] = atbat_1game$pitcher[which(atbat_1game$num == thisnum)]
pitch_1game$pitcher = pitcher

Now that we know the pitcher that threw each pitch, we can find the pitch count. This R script first ensures that event_num is treated like a number and not a string. This is important because we will use event_num to put the game's pitches in chronological order.

pitch_1game$event_num = as.numeric(pitch_1game$event_num)
 pitch_1game = pitch_1game[order(pitch_1game$event_num),]

This R script takes makes a new variable for pitch count. For a given pitcher, it marks the pitches as 1, 2, ... up to the number of pitches thrown. It does this separately for each pitcher, and when it's done, it puts that new variable into the 1-game data frame.

pitchcount = rep(NA,nrow(pitch_1game))

for(thispitcher in unique(pitcher))
    idx = which(pitcher == thispitcher)
    pitchcount[idx] = 1:length(idx)
pitch_1game$pitchcount = pitchcount

plot(pitch_1game$end_speed ~ pitch_1game$pitchcount)

plot(pitch_1game$nasty ~ pitch_1game$pitchcount)

plot(pitch_1game$spin_rate ~ pitch_1game$pitchcount)]

These tables are linked by some identifying variables.

gameday_link, example:  gid_2013_06_01_wasmlb_atlmlb_1

This is found in all five tables, it identifies the game as...

...happening on 2013-06-01,
...with Washington as the visiting team,
...and Atlanta as the home team,
...and was the first game between these teams that day

(In the case of two games in a day, the gameday link will end in _2 instead of _1 )


Every event in a game has a number relating to its chronological events. The first recorded pitch is event_num is 3.

After that, every pitch, pickoff attempt, running events, and entry in the 'action' table is given its own event_num.

Since pitchRx is based in SQL, the order of the rows that get scraped isn't guaranteed. The event_num variable is very useful if row-order matters to you.

Remember to save your work!

The data you scrape from pitchFx is NOT automatically saved to a file like nhlscrapr is.

It's probably worth the extra effort to save the tables as separate .csv files.

write.csv(dat$pitch, "Pitch Data 2013-06-01.csv")
write.csv(dat$atbat, "At Bat Data 2013-06-01.csv")

No comments:

Post a Comment