Tuesday, 14 July 2015

Lesson Prototype - Regular Expressions in R

This is a prototype of a lesson and accompanying exercise on regular expressions in R. It runs through how strings are handled in R including vectors of strings. Then it covers some of the regular expression basics and demonstrates them with the stringr function str_detect().
There are some partially finished exercises at the end of the lesson. The exercises are outlined, and a list of anticipated errors are included for each question.




As always, feedback is appreciated.
 
LibreOffice Version of Lesson

PDF Version of Lesson

Previous blog post on the stringr package


Lesson contents:
  • Strings in R
    • Strings can be stored and manipulated in a vector
    • Strings are not factors
    • Escape sequences
    • The str_match() function
  • Regular expressions in R
    • Period . means 'any'
    • Vertical line | means 'or'
    • +, *, and {} define repeats
    • ^ and $ mean 'beginning with' and 'ending with'
    • [] is a shortcut for 'or'
    • hyphens in []
    • example: building a regular expression for phone numbers 
  • Exercises
    • Detect e-mail addresses
    • Detect a/an errors
    • Detect Canadian postal codes

Likewise, if a regular expression ends with a dollar sign $ , the string must end with the pattern to count as a match. If both ^ and $ are present, then the string cannot contain anything except the pattern.

str_detect("caterpiller","cat$")
[1] FALSE
str_detect("caterpiller","pillar$")
[1] FALSE
str_detect("caterpiller","^cat$")
[1] FALSE
str_detect("caterpiller","^cat|pillar$")
[1] TRUE

Square brackets represent ranges of characters. Anything within the square brackets is usable for a match. Ranges are a shorthand way of writing a long list of 'or' statements. For example (i|y) and [iy] mean the same thing, "i or y".


str_detect(c("tire","tier","tyre"), "t(i|y)re")
[1] TRUE FALSE TRUE
str_detect(c("tire","tier","tyre"), "t[iy]re")
[1] TRUE FALSE TRUE

Ranges are defined in one of two ways.
1. As a literal list, such as [abcde], meaning "a,b,c,d, or e".
2. As an hypenated range, such as [a-z], meaning "any lowercase letter"

These methods can be combined, so [ABCa-z] means "any lowercase or A, B, or C".
Also, more than one range can be used. A common one is [a-zA-Z0-9], meaning "any number or letter".
If you wish to include a hyphen in a range, it has to be at the beginning or end of the [ ] range. For example, [abd-] means "a,b,d, or hyphen", where [ab-d] means "a, b, c, or d".

Ranges inside square brackets can be combined with the *, +, or {} operators to create very useful regular expressions.

"[a-z]+ [a-z]+" means two lower case words (note the space).

"[a-zA-Z-]+" means any word, possibly hyphenated.

"(19[0-9]{2})|(20[0-9]{2})" will catch any year from 1900 to 2099

"[0-9]{3}-[0-9]{4}" will catch a seven-digit phone number.

"[0-9]{3}[ -]*[0-9]{4}" will catch a seven-digit phone number, even if spaces are included or the hyphen missing.

"(\\(*[0-9]{3}\\)*)* *[0-9]{3}[ -]*[0-9]{4}" will catch a phone number whether or not the area code is included. Note the escape sequences for parentheses, \\( and \\)


Regex Exercise 1:

From the text within "ex_regex_1.txt", use readLines() to bring the data into R, and then use str_detect() command in the stringr package to find all the lines in the text that include an e-mail address.

You can check your work by loading the packette "text_analysis_solutions.zip" and entering the command "check_solution(solution, "regex01")"

Hints:
Make sure to load the package before you start by enetering "library(stringr)" into R. Once the package is loaded, you can find examples of str_detect() in R by entering "?str_detect" to bring up a help screen. The examples are at the bottom of the help page.

E-mail addresses are of the form "name@domain.tld". The last part is called the 'top level domain'.

You can assume that all top level domains are formed like '.ca', '.com', '.mobi', or '.co,uk'. No top level domain will use international symbols, for example, any India domains will use '.in' rather than the Hindi equivalent.

Challenge: Use str_extract_all() to extract the e-mails from the data.

------------

For RegEx Exercise 1, the following errors are anticipated by check_solution()

Anticipated errors -- Feedback

Not including numbers. -- Remember: Names and domains can include numbers, such as monty123@python.co.uk, or fake@1000words.com.

Missing the @ symbol. -- Consider the 'at' symbol as well.

Missing the top level domain – Consider the top level domain as well, such as '.com', or '.edu'.

Allowing for multiple @ symbols. -- There should be only one 'at' symbol, between the name and the domain.




Regex Exercise 2:

From the text within "ex_regex_2.txt", use readLines() and str_detect() to find all the lines in the text that include an incorrect usage of 'a' or 'an'.

Hints:
"an" comes before a vowel, such as 'an anvil', or 'an operation'.
"a" comes before a consonant, which any non-vowel, such as 'a butterfly', or 'a dragon'.
The a/an rule uses the way something in pronounced. Initialisms such as 'PHP' and 'IBM' are always in uppercase and are pronounced 'pee-aych-pee' and 'aye-bee-emm', respectively.

You can check your work with the command "check_solution(solution, "regex02")"

---------------

For RegEx Exercise 2, the following errors are anticipated by check_solution()

Anticipated errors -- Feedback

Only one of the first two a/an conditions is being checked. -- To check more than one condition at once, use the 'or' operator | between cases.

Uppercase isn't being checked. -- Uppercase needs to be checked too. For example, use the range [aeiouAEIOU] instead of [aeiou].

Initialisms aren't being accounted for. -- Initialisms are a special case. For example MTV is pronounced 'em-tee-vee', meaning the M starts with a vowel sound. The letters that start with a vowel sound are AEFIHLMNORSUX.

Catching the ends of words – The ends of words instead of just 'a' or 'an' are being caught, such as 'pizza elbow'. Consider how you can use spaces to avoid that.



Regex Exercise 3:

From the text within "ex_regex_3.txt", use readLines() and str_detect() to find all the lines in the text that include a Canadian postal code. Canadian codes are of the form 'v1a2b3', or 'v1a 2b3', of either lower or upper case.

You can check your work with the command "check_solution(solution, "regex03")"

Challenge: Use str_extract_all() to extract the postal codes from the data, and use str_replace_all() to remove any in-between spaces and padding spaces.



------------

For RegEx Exercise 3, the following errors are anticipated by check_solution().

Anticipated errors -- Feedback

Not checking for postal codes at all. -- Postal codes have to be checked as well.

Missing the space in the middle of postal code. -- Sometimes postal codes have a space in the middle.

Allowing for spaces anywhere in the postal code. -- Postal codes only have space between the 3rd and 4th character.

Allowing for any arrangement of letters and numbers in a postal code – Postal codes are always of the form "letter-number-letter number-letter-number".

Missing capitals – Almost there! But... sometimes postal codes are written in upper case.

Missing lowercase – Technically right, but people are lazy and sometimes write postal codes in lower case too.

I hope this lesson is easy to digest and makes you feel bright.