Featured post

Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

Thursday 22 January 2015

R Package Spotlight - stringr

One of my ongoing projects is to maintain a database of international cricket games, which I do by scraping the text from ESPN Cricinfo. For this kind of work, the R package 'stringr' is essential.

One task involved in this project is to find the names of all the players and their batting order each Twenty20 International and One-Day International matchup. Here is a step-by-step of how to use stringr to extract the names. If you want to try it yourself, the file Game_Summary.txt, is literally a Select All + Copy/Paste from the following link http://www.espncricinfo.com/ci/engine/match/287876.html .

First we load the text data. Make sure to set the working directory with setwd() first

intext = readLines("Game_Summary.txt" , warn="FALSE")
intext = str_replace_all(intext,"\t","   ") ## Removes the tabs and replaces them with triple spaces for better reading

readLines is a base input function like read.csv. It makes 'intext' a 1D array of the lines from the text file. Which is...

[1] 497

... biggish.

There are nearly 500 lines in the game summary, and we are only looking for 22 names, so a lot of this is going to be garbage.

Ideally, we only want to inspect lines that have the names, and as few lines that do not as possible.
Having looked at the text in Notepad++ http://notepad-plus-plus.org/ , we see the batting summary after the first few lines.

 [1] "T20I no. 43 | 2007/08 season"                                                     
 [2] "Played at Kingsmead, Durban"                                                     
 [3] "20 September 2007 - night match (20-over match)"                                  
 [4] "    India innings (20 overs maximum)   R   M   B   4s   6s   SR"                 
 [5] "View dismissal   G Gambhir   c Smith b Pollock   19   19   19   3   0   100.00"   
 [6] "View dismissal   V Sehwag   c †Boucher b Ntini   11   23   11   1   0   100.00"   
 [7] "View dismissal   KD Karthik   c JA Morkel b Pollock   0   1   1   0   0   0.00"  
 [8] "View dismissal   RV Uthappa   c Smith b M Morkel   15   22   16   1   1   93.75"  
 [9] "RG Sharma   not out   50   52   40   7   2   125.00"                              
[10] "View dismissal   MS Dhoni*†   run out (Philander)   45   39   33   4   1   136.36"
[11] "IK Pathan   not out   0   1   1   0   0   0.00"     

The names start after the line "India innings (20 overs maximum)...", and inspection of several such game summaries shows a similar pattern. We can use this as a marker for when to start looking for names with the following code.

for(i in 1:length(intext))
   #### The rest of the code will go here!
   if(str_detect(intext[i], " innings [(][0-9]+"))
       scanstate = 1

Since we need to retain information from one line to the next, we cannot use the more scalable method of apply() for this problem (in any simple way, at least).
'scanstate' is just a variable name we're using where 0 indicates we're not looking for names, and 1 indicates that we are looking for names.

str_detect( X ,Y ) is where the action happens. It returns TRUE if the pattern Y appears in the string X, and FALSE otherwise.

The pattern Y is a regular expression, it literally means exactly " innings (", then any collection of digits from 0 to 9.

The [ and ] are collection markers, which allow you to define 0-9 as 'any digit'. The + plus is used to say 'at least one character in the collection in [] in a row'.

Paretheses like ( and ) are special characters that regular expressions use for other things, so to use a literal left parethesis, we have to treat it as part of a collection, even if in this case '(' is the only character in that collection.

We want to be as specific as possible here to avoid catching things that are not markers of the start of the batting summary. That's why we don't simply search for ' innings'.

Now that we have a means of determining what lines are from the batting summary, we can proceed to extract the names.

if(scanstate == 1)
    thistext = intext[i]

    playlist[playeridx,scanstate] = thistext
    playeridx = playeridx + 1

    if(playeridx > 11){scanstate = 0}

thistext, playlist (2x11 array), and playeridx (starting at 1) are variables that we have specified.

This code will take the first 11 lines of the batting summary and save then as if they were the names of the first team. Unfortunately, this will grab the entire line instead of just the name, such as

"View dismissal   V Sehwag   c †Boucher b Ntini   11   23   11   1   0   100.00" 

where all we really want is "V Sehwag".

So we need a way to...
1) Remove 'View dismissal', from the start of each string (assuming no player is named View dismissal ), which can be done with a replacement.

thistext = str_replace(thistext,"View dismissal", "")

2) Extract only the text that happens BEFORE the summary of the player. Inspection will show that all such summaries start with ' c ', ' lbw ', ' b ', ' not out ', or some other short tag that explains the fate of the Batsman/Batter. This code...

str_split_fixed(thistext," c | lbw | b | not out | run out | st ",2)[1]

...splits the single element 'thistext' (first parameter) into an 1D array of size 2 (third parameter), where anything fitting the pattern in the second paramter is used as the marker of the break between the elements.

In regular expressions, | means 'or', as in the pattern ' c ', or the pattern ' lbw ', or ... the pattern ' st '. We only want to keep the first element of this array, which should be the name, so we finish this line with [1] outside the function.

With some additional similar housekeeping, we add this code chunk to the beginning of the for loop.

if(scanstate == 1)

 thistext = intext[i]
 thistext = str_replace(thistext,"View dismissal", "") ### Fixes problem 1. Only one replacement needed.
 thistext = str_split_fixed(thistext," c | lbw | b | not out | run out | st ",2)[1] ### Fixes problem 2
 thistext = str_trim(thistext) ### Removes whitespace on the ends
 thistext = str_replace_all(thistext,"[*]|†","") ### Removes all the special characters * and cross

 playlist[playeridx,scanstate] = thistext
 playeridx = playeridx + 1

 if(playeridx > 11){scanstate = 0}

We are not quite done. Sometimes teams run out of overs before they run out of wickets, especially in Twenty20. This means we will miss players labelled under 'did not bat',

"Did not batHarbhajan Singh, Joginder Sharma, S Sreesanth, RP Singh"

and catch lines that are past the list of player names, like this:

"Fall of wickets 1-32 (Gambhir, 4.4 ov), 2-33 (Karthik, 4.6 ov), 3-33 (Sehwag, 5.1 ov), 4-61 (Uthappa, 10.3 ov), 5-146 (Dhoni, 19.4 ov)"

We can make another block of code to search for the list of non-batters.
Here, we only look at cases that start with 'Did not bat', then we remove that first part and split along ', '. We use str_split instead of str_split_fixed, because we do not know in advance how many people will be on the list.

Finally, we turn the list from str_split into an array
like the one we get from str_split_fixed with the unlist() function, which is part of base R.

if(scanstate == 1 & str_detect(intext[i],"Did not bat"))
   thistext = intext[i]
   thistext = str_split_fixed(thistext,"Did not bat",2)[2]
   thistext = unlist(str_split(thistext,", "))

   playlist[playeridx:11,scanstate] = thistext

   scanstate = 0
   playeridx = 1

Then, to fix the 'extra lines' problem, we add some conditions to the first if statement. Specifically, we ignore lines with the terms 'Extras', 'Total', and 'Did not bat'.
if(scanstate == 1 
         & !str_detect(intext[i],"Extras") 
         & !str_detect(intext[i],"Total   ") 
         & !str_detect(intext[i],"Did not bat") )

Now we run this code, we get the list of players in their (intended) batting order, and nothing more!

 [1,] "G Gambhir"      
 [2,] "V Sehwag"       
 [3,] "KD Karthik"     
 [4,] "RV Uthappa"     
 [5,] "RG Sharma"      
 [6,] "MS Dhoni"       
 [7,] "IK Pathan"      
 [8,] "Harbhajan Singh"
 [9,] "Joginder Sharma"
[10,] "S Sreesanth"    
[11,] "RP Singh"   

Here is the complete code for this task, which I used to find the names of both teams.


  1. Useful explainer of the strinr package, but FYI this task is a lot easier with Hadley Wickham's rvest package and the incredibly useful Selectorgadget browser add-on. Four lines of code do the trick:

    pagehtml <- html("http://www.espncricinfo.com/ci/engine/match/287876.html")
    playerhtml <- html_nodes(pagehtml, ".to-bat .playerName , .batsman-name .playerName")
    playernames <- html_text(playerhtml)

    Here's an explanation of how to use Selectorgadget to get the CSS page selectors you want and then how to use those in rvest: http://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html

  2. Hi Sharon,

    Thanks for the tip about rvest! I can see now how that would work a lot better in a case like this where the data is already compiled into an HTML table with its own labels.

    My experience with cricinfo is mostly with the play-by-play commentary, which is less structured, so I hadn't even considered an HTML-based scraping approach. Thank you!