Statistics et al.: Natural Language Processing in R: Strings and Regular Expressions.

In this post, I go through a lesson in natural language processing (NLP), in R. Specifically, it covers how strings operate in R, how regular expressions work in the stringr package by Hadley Wickham, and some exercises. Included with the exercises are a list of expected hang-ups, as well as an R function that can quickly check the solutions.

This lesson is designed for a 1.5-2 hour class for senior undergrads.

Contents:

Strings in R

Strings can be stored and manipulated in a vector
Strings are not factors
Escape sequences
The str_match() function

Regular expressions in R

Period . means 'any'
Vertical line | means 'or'
+, *, and {} define repeats
^ and $ mean 'beginning with' and 'ending with'
[] is a shortcut for 'or'
hyphens in []
example: building a regular expression for phone numbers

Exercises

Detect e-mail addresses
Detect a/an errors
Detect Canadian postal codes

LibreOffice Version of Lesson

PDF Version of Lesson

Previous blog post on the stringr package

String Variables in R

String manipulation is largely the same across all popular programming languages and platforms, so most of the principles can be learned through base R, the stringr package, and regular expressions which are used in both.

In R, a string such as "apples taste good" is treated as an atomic object. That is, you can't reference just part of it using the index brackets []. (Later, we'll cover substrings, which allow you to look at only part of a single string using functions in stringr,) If you are familiar with more low-level languages like C, this may be counter intuitive because the char variable type refers to a single letter or number.

You can declare a string variable in R like so:

x = "apples"

You can also declare and work with a vector (or matrix, or array) of strings the same way as one would declare any other vector of variables. As an example, x below is a vector of strings:

x = c("apples", "oranges", "pears")

, so is y, which includes a missing value and a number

y = c("a", "b", NA, 32)
y
#output: [1] "a" "b" NA "32"

, and so is z, which starts off as all numbers. If any value in a vector is a string, the whole vector is treated as a string. This is a common issue when inputting data from a file.

z = c(12,45,"cows go moo")
z
#output: [1] "12" "45" "cows go moo"

Mathematical functions won't work with strings

x + 3
output: Error in x + 3 : non-numeric argument to binary operator

But vector functions like concatenations work fine.
c(x, "banana")
output: [1] "apples" "oranges" "pears" "banana"

Try these to explore how vector functions work with vectors of strings.
sample(x)
letters
LETTERS
sample(letters[1:10], 3)
rev(letters)
rev(x)
table(x)
rep(x,times=2)
rep(x,each=2)
rep(x, times=2, each=2)

Also, strings are NOT factors. In factors, there is some underlying structure that links certain values to certain names. In a string vector, each string is stored exactly as the value in the quotation marks, without any special coding.

Try these to explore the relationship between strings and factors:
is.factor(x)
as.factor(x)
as.numeric(as.factor(x))
as.character(as.factor(x))
as.numeric(x)

Data input functions read.table() and read.csv() by default interpret strings as factors. This is overridden by including the option as.is=TRUE in each function.

The quotation marks in a string mark the beginning and end of that string. That leaves a problem when a string has quotation marks in it. For cases like that, there are patterns called escape characters, for example:

\" in a string means 'literally a quotation mark' rather than 'end the string here'

There are other useful escape characters too:

\t means 'tab'
\n means 'new line', and
\\ means 'literally a backslash' rather than 'starting an escape character'.

The str_match() function

In the package stringr, which you can install with install.packges("stringr") and load with library(stringr) , there are many handy functions for string manipulation which will be covered over the next few lessons. For now, we'll focus on the str_detect() function.

str_detect(x, pattern) detects if the pattern 'pattern' appears in string 'x'.

For example

str_detect("apples", "app")
#output: TRUE

str_detect("orange","app")
#output: FALSE

Like all other functions in stringr, str_detect() is case sensitive, so

str_detect("apples","APP")
#output: [1] FALSE

str_detect("APPLES",app")
#output: [1] FALSE

A vector of strings can be checked all at once and a vector of answers are given

str_detect(c("apples","orange","potato"),"app")
#output: [1] TRUE FALSE FALSE

Regular Expressions

You can tell R (or many other programming environments) to search out more general patterns than literal maches by using regular expressions, or RegEx for short.

The period . means 'any character'. Any letter, number, space, or puncuation character will count as a match. However, there has to be something in the place of the period, and it only refers to a single character.

str_detect(c("dog","dig","dAg","dg","dawg"),"d.g")
#[1] TRUE TRUE TRUE FALSE FALSE

However, multiple periods can be used

str_detect(c("dog","dawg","drag","d og"), "d..g")
#[1] FALSE TRUE TRUE TRUE
If a literal period is required, use the escape sequence \\.

str_detect(c("dog","d g","d.g","d.ggy paws"), "d\\.g")
#[1] FALSE FALSE TRUE TRUE

The vertical line | is the "or" operator, and is used similarly to how it is in R logic.
The regular expression "cat|dog" means literally "cat" or "dog".

str_detect("caterpillar", "cat|dog")
#[1] TRUE
str_detect("dogma","cat|dog")
#[1] TRUE
The plus sign + means "at least one repeat of the thing before it". By default the thing before it is a single character, but with round pararentheses, multiple characters or a more complex subpattern can be included.

str_detect("doooogma","do+g")
#[1] TRUE
str_detect("dgma","do+g")
#[1] FALSE
str_detect("dogogogma", "d(og)+ma")
#[1] TRUE

The asterisk * works just like the + sign, but is "at least ZERO repeats of the thing" instead. This means a match will be found even when the character or subpattern before the star is there. The 'at least zero' operator is especially useful for spaces when a term is somtimes written as a compound word.

str_detect("doooogma","do*g")
#[1] TRUE
str_detect("doofoogma","do*g")
#[1] FALSE
str_detect("dgma","do*g")
#[1] TRUE
str_detect("hyperprior","hyper *prior")
#[1] TRUE

To be more specific about the number of allowable repeats, use curly braces.
{2} means "exactly two repeats"
{,4} means "from zero to four repeats"
{2,5} means "between two and five repeats, inclusive"

When the regular expression starts with a carat ^ , the string being checked must start with that pattern to count as a match.

str_detect("caterpillar","^cat")
#[1] TRUE
str_detect("caterpillar","^ter")
#[1] FALSE
str_detect("caterpillar","^pillar")
#[1] FALSE
str_detect("caterpillar","ter")
#[1] TRUE
str_detect("caterpillar","pillar")
#[1] TRUE

Likewise, if a regular expresion ends with a dollar sign $ , the string must end with the pattern to count as a match. If both ^ and $ are present, then the string cannot contain anything except the pattern.

str_detect("caterpiller","cat$")
#[1] FALSE
str_detect("caterpiller","pillar$")
#[1] FALSE
str_detect("caterpiller","^cat$")
#[1] FALSE
str_detect("caterpiller","^cat|pillar$")
#[1] TRUE

Square brackets represent ranges of characters. Anything within the square brackets is usable for a match. Ranges are a shorthand way of writing a long list of 'or' statements. For example (i|y) and [iy] mean the same thing, "i or y".

str_detect(c("tire","tier","tyre"), "t(i|y)re")
# [1] TRUE FALSE TRUE

str_detect(c("tire","tier","tyre"), "t[iy]re")
#[1] TRUE FALSE TRUE

Ranges are defined in one of two ways.
1. As a literal list, such as [abcde], meaning "a,b,c,d, or e".
2. As an hypenated range, such as [a-z], meaning "any lowercase letter"

These methods can be combined, so [ABCa-z] means "any lowercase or A, B, or C".
Also, more than one range can be used. A common one is [a-zA-Z0-9], meaning "any number or letter".
If you wish to include a hyphen in a range, it has to be at the beginning or end of the [ ] range. For example, [abd-] means "a,b,d, or hyphen", where [ab-d] means "a, b, c, or d".

Ranges inside square brackets can be combined with the *, +, or {} operators to create very useful regular expressions.

"[a-z]+ [a-z]+" means two lower case words (note the space).

"[a-zA-Z-]+" means any word, possibly hyphenated.

"(19[0-9]{2})|(20[0-9]{2})" will catch any year from 1900 to 2099

"[0-9]{3}-[0-9]{4}" will catch a seven-digit phone number.

"[0-9]{3}[ -]*[0-9]{4}" will catch a seven-digit phone number, even if spaces are included or the hyphen missing.

"(\$*[0-9]{3}\$*)* *[0-9]{3}[ -]*[0-9]{4}" will catch a phone number whether or not the area code is included. Note the escape sequences for parentheses, \$ and \$
Regex Exercise 1:

From the text within "ex_regex_1.txt", use readLines() to bring the data into R, and then use str_detect() command in the stringr package to find all the lines in the text that include an e-mail address.

You can check your work by loading the packette "text_analysis_solutions.zip" and entering the command "check_solution(solution, "regex01")"

Hints:
Make sure to load the package before you start by enetering "library(stringr)" into R. Once the package is loaded, you can find examples of str_detect() in R by entering "?str_detect" to bring up a help screen. The examples are at the bottom of the help page.

E-mail addresses are of the form "name@domain.tld". The last part is called the 'top level domain'.

You can assume that all top level domains are formed like '.ca', '.com', '.mobi', or '.co,uk'. No top level domain will use international symbols, for example, any India domains will use '.in' rather than the Hindi equivalent.

Challenge: Use str_extract_all() to extract the e-mails from the data.

------------

For RegEx Exercise 1, the following errors are anticipated by check_solution()

Anticipated errors -- Feedback

Not including numbers. -- Remember: Names and domains can include numbers, such as monty123@python.co.uk, or fake@1000words.com.

Missing the @ symbol. -- Consider the 'at' symbol as well.

Missing the top level domain – Consider the top level domain as well, such as '.com', or '.edu'.

Allowing for multiple @ symbols. -- There should be only one 'at' symbol, between the name and the domain.

Likewise, if a regular expression ends with a dollar sign $ , the string must end with the pattern to count as a match. If both ^ and $ are present, then the string cannot contain anything except the pattern.

str_detect("caterpiller","cat$")

[1] FALSE

str_detect("caterpiller","pillar$")

[1] FALSE

str_detect("caterpiller","^cat$")

[1] FALSE

str_detect("caterpiller","^cat|pillar$")

[1] TRUE

Square brackets represent ranges of characters. Anything within the square brackets is usable for a match. Ranges are a shorthand way of writing a long list of 'or' statements. For example (i|y) and [iy] mean the same thing, "i or y".

str_detect(c("tire","tier","tyre"), "t(i|y)re")

[1] TRUE FALSE TRUE

str_detect(c("tire","tier","tyre"), "t[iy]re")

[1] TRUE FALSE TRUE

Ranges are defined in one of two ways.

1. As a literal list, such as [abcde], meaning "a,b,c,d, or e".

2. As an hypenated range, such as [a-z], meaning "any lowercase letter"

These methods can be combined, so [ABCa-z] means "any lowercase or A, B, or C".

Also, more than one range can be used. A common one is [a-zA-Z0-9], meaning "any number or letter".

If you wish to include a hyphen in a range, it has to be at the beginning or end of the [ ] range. For example, [abd-] means "a,b,d, or hyphen", where [ab-d] means "a, b, c, or d".

Ranges inside square brackets can be combined with the *, +, or {} operators to create very useful regular expressions.

"[a-z]+ [a-z]+" means two lower case words (note the space).

"[a-zA-Z-]+" means any word, possibly hyphenated.

"(19[0-9]{2})|(20[0-9]{2})" will catch any year from 1900 to 2099

"[0-9]{3}-[0-9]{4}" will catch a seven-digit phone number.

"[0-9]{3}[ -]*[0-9]{4}" will catch a seven-digit phone number, even if spaces are included or the hyphen missing.

"(\$*[0-9]{3}\$*)* *[0-9]{3}[ -]*[0-9]{4}" will catch a phone number whether or not the area code is included. Note the escape sequences for parentheses, \$ and \$

Regex Exercise 1:

From the text within "ex_regex_1.txt", use readLines() to bring the data into R, and then use str_detect() command in the stringr package to find all the lines in the text that include an e-mail address.

You can check your work by loading the packette "text_analysis_solutions.zip" and entering the command "check_solution(solution, "regex01")"

Hints:

Make sure to load the package before you start by enetering "library(stringr)" into R. Once the package is loaded, you can find examples of str_detect() in R by entering "?str_detect" to bring up a help screen. The examples are at the bottom of the help page.

E-mail addresses are of the form "name@domain.tld". The last part is called the 'top level domain'.

You can assume that all top level domains are formed like '.ca', '.com', '.mobi', or '.co,uk'. No top level domain will use international symbols, for example, any India domains will use '.in' rather than the Hindi equivalent.

Challenge: Use str_extract_all() to extract the e-mails from the data.

------------

For RegEx Exercise 1, the following errors are anticipated by check_solution()

Anticipated errors -- Feedback

Not including numbers. -- Remember: Names and domains can include numbers, such as monty123@python.co.uk, or fake@1000words.com.

Missing the @ symbol. -- Consider the 'at' symbol as well.

Missing the top level domain – Consider the top level domain as well, such as '.com', or '.edu'.

Allowing for multiple @ symbols. -- There should be only one 'at' symbol, between the name and the domain.

Regex Exercise 2:

From the text within "ex_regex_2.txt", use readLines() and str_detect() to find all the lines in the text that include an incorrect usage of 'a' or 'an'.

Hints:

"an" comes before a vowel, such as 'an anvil', or 'an operation'.

"a" comes before a consonant, which any non-vowel, such as 'a butterfly', or 'a dragon'.

The a/an rule uses the way something in pronounced. Initialisms such as 'PHP' and 'IBM' are always in uppercase and are pronounced 'pee-aych-pee' and 'aye-bee-emm', respectively.

You can check your work with the command "check_solution(solution, "regex02")"

---------------

For RegEx Exercise 2, the following errors are anticipated by check_solution()

Anticipated errors -- Feedback

Only one of the first two a/an conditions is being checked. -- To check more than one condition at once, use the 'or' operator | between cases.

Uppercase isn't being checked. -- Uppercase needs to be checked too. For example, use the range [aeiouAEIOU] instead of [aeiou].

Initialisms aren't being accounted for. -- Initialisms are a special case. For example MTV is pronounced 'em-tee-vee', meaning the M starts with a vowel sound. The letters that start with a vowel sound are AEFIHLMNORSUX.

Catching the ends of words – The ends of words instead of just 'a' or 'an' are being caught, such as 'pizza elbow'. Consider how you can use spaces to avoid that.

Regex Exercise 3:

From the text within "ex_regex_3.txt", use readLines() and str_detect() to find all the lines in the text that include a Canadian postal code. Canadian codes are of the form 'v1a2b3', or 'v1a 2b3', of either lower or upper case.

You can check your work with the command "check_solution(solution, "regex03")"

Challenge: Use str_extract_all() to extract the postal codes from the data, and use str_replace_all() to remove any in-between spaces and padding spaces.

------------

For RegEx Exercise 3, the following errors are anticipated by check_solution().

Anticipated errors -- Feedback

Not checking for postal codes at all. -- Postal codes have to be checked as well.

Missing the space in the middle of postal code. -- Sometimes postal codes have a space in the middle.

Allowing for spaces anywhere in the postal code. -- Postal codes only have space between the 3^rd and 4^th character.

Allowing for any arrangement of letters and numbers in a postal code – Postal codes are always of the form "letter-number-letter number-letter-number".

Missing capitals – Almost there! But... sometimes postal codes are written in upper case.

Missing lowercase – Technically right, but people are lazy and sometimes write postal codes in lower case too.

About check_solution()

The command "check_solution(solution, exercise_code)" has its R code written in text_analysis_solutions.zip. At the most basic functionality, it determines if the object named "solution" is a correct answer to the exercise matching the string named "exercise_code".

However, it some cases, it can recognize certain incorrect or partially correct solutions and offer feedback specific based on the problem that most likely caused it. For example, if a correct solution is applied to RegEx Exercise 1 would find that only the lines 3,5,and 8 contain the text pattern of interest, and the learner's input is "lines 5,8, and 11 contain the pattern", check_pattern() will return "Not quite, have a look at lines 3 and 11". Furthermore, if that particular learner input comes from an anticipated type of error, check_pattern() will also reply with feedback such as "make sure you've considered uppercase letters as well". By default, only one such line of feedback is given at once based on the most basic error made, but the learner has the option of receiving all the feedback if they wish.

I hope this lesson is easy to digest and makes you feel bright.

Statistics et al.

Featured post

Textbook: Writing for Statistics and Data Science

Tuesday, 9 April 2019

Natural Language Processing in R: Strings and Regular Expressions.

String Variables in R

The str_match() function

Regular Expressions

Regex Exercise 1:

Regex Exercise 2:

Regex Exercise 3:

About check_solution()

No comments:

Post a Comment