Featured post

Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

Saturday, 9 November 2019

Basketball Data Science with Applications in R - An Advance Review

My overall impression of "Basketball Data Science" is that it's exactly the sort of book I would recommend to an instructor or able student of statistics in sport. Most of my criticisms are because it's not the book I imagine I would have written, not because the book isn't good.

Transparency statement:

This is a review of an advance beta copy of Basketball Data Science by Paola Zuccolotto and Marica Manisera that I received from CRC Press as one of 9 people to determine the market viability this book. I was paid to review this book, but I was not paid to publish that review. I've done my best to update this review to reflect the changes made to the book since then.

General impression:

On a spectrum from "all methodology", like Casella and Berger's famous textbook, to "all results" like a typical sports almanac, I would put "Basketball Data Science" between Casella-Berger and Jim Albert's "Visualizing Baseball". "Basketball Data Science" is intended to show how the data analysis is done, rather than revealing many sports-related insights directly. Also, this is a data science book, not a classical statistics book, and it focuses much more on description than on testing hypotheses. By my count, there are only 4 p-values used in all 257 pages.

This is mostly, but not entirely, the right choice. In one case, the authors claim that the Golden State Warriors tend to concentrate their shots in the first half of each quarter and first half of each game, effectively claiming that the team focuses 52% and 54% of their shots, respectively, in 50% of the game time.

Each chapter goes through a different type of analysis, including basic player statistics, passing tendencies, pace-of-play comparisons, and spatial analysis. One chapter looks at the inequality of points scored, with tools like the Lorenz curve and the Gini index which can be readily applied to other sports for things like shots, passes, and time in play.

The book reads like an extended manual for the author's R package, BasketballAnalyzeR, which is good for cohesive, digestible study material that minimizes external dependencies. However, this isn't the only R package for examining NBA data available. The package "py_ball", for example, is a Python API wrapper for both the NBA and WNBA websites. This isn't an oversight, but a deliberate choice by the authors for the sake of accessibility. It also minimizes dependencies on external sources and software.

On that note, I was disappointed to find that the book only included analysis and information related to the NBA. Along with the WNBA, there are leagues outside of North America that could really use some more statistical attention as well. (The authors addressed this by saying that they had tried to get data from the WNBA as well, but that it wasn't available. Yes, this is hypocritical of me to complain about given my own work.)


The book uses radial plots to demonstrate disparate statistical features about a player. I like them. The big issue with radial / umbrella plots is that they're sensitive to the order in which you arrange statistics, and I think the order chosen here is well thought out. Disagreements and hate mail welcome.

Early in the book, the authors mention that range of shots is not a reliable measure in the example case because of a single outlier player, Steven Adams. There's a teaching opportunity here for any instructors willing to follow the footnotes and references: What do you do in the face of unusual or unexpected data?

For the inequality section, a possible research question for students could address how the inequality of passes compares with time spent on the court per player, or the time shared between two players on the court? Do any of your conclusions change if you use a different measure of inequality like the Shannon diversity index?

It was surprising to learn that almost every player falls into one of three positions (guard, forward, and center), especially given how fluid the movement seems and how much the paradigms of basketball have changed over time. I specifically asked about this the first version of this review, and the authors added a footnote about hybrid positions "such as the point forward (a hybrid PG/SF), the swingman (a hybrid SG/SF), the big (a hybrid C/PF) and the stretch four (a PF with the shooting pattern of a typical SG)." but all of these were apparently recognized in the 1950's and not since. I'd like to add one more special case to that list: Bob McAdoo, which Jon Bois explains at the 13:15 mark of The Bob Emergency: Part 2 https://youtu.be/dcGamPqUIxI?t=795 better than I ever could.

While fouls are mentioned a few times, and are readily available in the software package, there isn't much in the way of analysis of them. What happens when a star player has a lot of fouls? There's another student project.

To reiterate, I learned a lot from reading this, as a statistician and a relative outsider to basketball. It made the recent playoff games with the Toronto Raptors and the Golden State Warriors all the more enjoyable.

No comments:

Post a comment