Tuesday, 14 July 2015

Lesson Prototype - Regular Expressions in R

My thesis has been handed in, so I've been restarting and catching up on other projects. Blog posting will continue at its normal pace of about 4 posts/month.

For the data management course I'm building, I want to include a module on text analysis. This is a prototype of a lesson and accompanying exercise on regular expressions in R. It runs through how strings are handled in R including vectors of strings. Then it covers some of the regular expression basics and demonstrates them with the stringr function str_detect().

There are some partially finished exercises at the end of the lesson. The exercises are outlined, and a list of anticipated errors are included for each question. However, the text files needed to actually do the exercises and check your work are not yet done.

Since all the exercises are programming based, I can write a solution checker that looks at the submitted answers and give feedback based on anticipated mistakes. I talk a bit about such a program at the end of the document.

As always, feedback is appreciated.
 
LibreOffice Version of Lesson

PDF Version of Lesson

Previous blog post on the stringr package


Lesson contents:
  • Strings in R
    • Strings can be stored and manipulated in a vector
    • Strings are not factors
    • Escape sequences
    • The str_match() function
  • Regular expressions in R
    • Period . means 'any'
    • Vertical line | means 'or'
    • +, *, and {} define repeats
    • ^ and $ mean 'beginning with' and 'ending with'
    • [] is a shortcut for 'or'
    • hyphens in []
    • example: building a regular expression for phone numbers 
  • Exercises
    • Detect e-mail addresses
    • Detect a/an errors
    • Detect Canadian postal codes