Featured post

Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

Friday 29 December 2023

Rating Systems Explained

Teams in league sports often play balanced schedules. For example, in each season of the English Premier League, each football/soccer team plays each other team in the league exactly twice - once at their home stadium, and once at the opposing stadium. That way, at the end of the season, you can merely look at the record of wins, draws, and losses to determine which teams have done the best, second best, and so on. This works because everyone has had the same opposition, so wins against that same opposition are comparable.

But what would happen if different teams played different opponents, or even different numbers of matches? This is exactly the situation in many eSports, as well as chess; individual players may play different amounts of time against completely different opponents. To compare competitors in such as situation, we can use a rating system.


A typical rating system assigns a score to each competitor, which is either an individual or a team, depending on the game being played. Higher scores indicate better long term performance by that competitor. When two competitors face each other, points are lost by the loser and gained by the winner, but the amount won or lost depends on how 'surprising' the result is.

For example, if a strong competitor (pre-match rating 1400) beats a weak competitor (pre-match rating 1000), perhaps only 2 points of rating would be exchanged (updated ratings of 1402 and 998, respectively), because this result was estimated to be the most likely based on previous performance. However, if the weak competitor wins, many more points of rating would be exchanged, say 16 points, updating the ratings to 1384 for the strong competitor and 1016 for the weak competitor. A draw can be considered half a win for each competitor, so a draw between unequal opponents would also result in an exchange of points; in our example the strong competitor would lose 7 (the average of gaining 2 and losing 16) points to the weak competitor (making the new ratings 1393 and 1007, respectively.).

In the long run, we would expect the strong competitor to win much more often, and thus gain a smaller number of points more often than the weak competitor. As such, the ratings should stabilize at their true relative values for each competitor after many matches.

The most famous, of these ratings systems is the Elo system, developed by Arpad Elo. Players across the world have ratings, and these ratings are used in part to determine tournament eligibility (e.g., a tournament may have an "Under 1600" division and an open division), or determine titles like grandmaster.

Elo has an implementation in R through the "elo" package on CRAN. There is documentation and some vignettes, including an introduction at https://cran.r-project.org/web/packages/elo/index.html . The "Calculating Running Elo Updates" has a classic example of two-team matches as well as an example for four-team competition.

Glicko (details at: http://www.glicko.net/glicko/glicko.pdf ), developed by Mark Glickman, assigns competitors both a rating and an uncertainty measure. As the a competitor plays in more matches, their uncertainty measure is reduced because we have more information about their true rating. As time passes, that information becomes less relevant, so the uncertainty measure increases. Ratings change more with each match when uncertainty is high.

Glicko, Elo, and other rating systems are implemented in the 'PlayerRatings' package on CRAN, with details here https://cran.r-project.org/web/packages/PlayerRatings/ .

Other rating systems incorporate other details. RAPTOR (details at: https://fivethirtyeight.com/features/introducing-raptor-our-new-metric-for-the-modern-nba/ ), developed by Nate Silver of 538, used player tracking data in NBA basketball to estimate offense and defense ability of individual players; RAPTOR then uses individual players in the aggregate to give a rating to a team. 


Glicko 2.0 incorporates winning or losing streaks, taking them as predictive indications that a competitor is actually better/worse than their wins/losses suggest individually.

Microsoft Trueskill 1 and 2 are particularly interesting because they manage competitions between more than two competitors, such as a multiplayer video game. Trueskill 1 uses only the ranking of players; to oversimplify, a player wins against everyone they rank higher than in a multiplayer match and lose against everyone they rank lower than. Trueskill 2 uses more detailed player performances like in-game score to determine adjustments to ratings. (Details of Trueskill 1 are at: https://www.microsoft.com/en-us/research/wp-content/uploads/2007/01/NIPS2006_0688.pdf , and Trueskill 2 are at: https://www.microsoft.com/en-us/research/uploads/prod/2018/03/trueskill2.pdf )

The 'TrueSkillThroughTime' package on CRAN implements the TrueSkill rating system and expands upon it. (See https://github.com/glandfried/TrueSkillThroughTime.R for details and an example). The simpler package 'trueskill' has been removed from CRAN because it was last updated in 2013.

These ratings systems are used to for predictions. In the FIDE implementation of Elo, a rating difference of 400 suggests a 10-to-1 advantage, and a 200-point difference suggests a square root of 10, or 3.16-to-1 advantage. That is, a player rated 1200 is expected to get 76% of the wins against a 1000-rated player, and a 24% of the wins against a 1400-rated player. In terms of American odds, if there is a difference of 200 points, fair odds would be -316 for the higher rated player, and +316 for the lower rated player, ignoring the chance of a draw. Likewise, in the classical implementation of Glicko, a difference of 400 points suggests a 10-to-1 advantage.

The predictions from Trueskill 2 (and Trueskill 1 in games older than 2018) are used for matchmaking purposes. When a player enters a queue for a match, they are paired with other players such that the estimated probability of that player overperforming or underperforming compared to opposing and teammate players is low.

In part 2 of this article, found in my gambling course to be released soon on Udemy, we will apply Glicko 1 to NHL ice hockey teams, and a system like Trueskill to horse races at Woodbine racetrack in Toronto.

See also: https://towardsdatascience.com/developing-a-generalized-elo-rating-system-for-multiplayer-games-b9b495e87802 (Developing a generalized Elo Rating system for 3+ player games)

and https://uwaterloo.ca/computational-mathematics/sites/ca.computational-mathematics/files/uploads/files/justin_dastous_research_paper.pdf (Trueskill with in-game events)

and https://cs.stanford.edu/people/paulliu//files/www-2021-elor.pdf (Elo for massive multiplayer)

No comments:

Post a Comment