Statistics et al.: Natural Language Processing in R: Edit Distance

These are the notes for the second lecture in the unit on text processing. Some useful ideas like exact string matching and the definitions of characters and strings are covered in the notes of Natural Language Processing in R: Strings and Regular Expressions

Edit distance, also called Levenshtein distance, is a measure of the number of primary edits that would need to be made to transform one string into another. The R function adist() is used to find the edit distance.

adist("exactly the same","exactly the same") # edit distance 0 
adist("exactly the same","totally different") # edit distance 14

A primary edit is one of three things:
a) An insertion of a character.
b) A deletion of a character.
c) A substitution of a character.

For example:

## Insertions: All of these will return an edit distance of 1.
adist("5678","05678") # inserting an '0' character before the '1'
adist("weasel", "w easel")  # inserting a space between w and e
adist("weasel", "weasel!")  # inserting a '!' character after the 'l'

## Inserting 5 characters gives an edit distance of 5
adist("Genome","Lord Genome")

## Deletions
adist("Genome","Gnome") # deletion of the first 'e': 1 edit
adist("Genome","nome")  # deleting the first two characters: 2 edits
adist("Canada","Cnd")   # deleting all three ehs: 3 edits

## substitutions
adist("iiiii","iixii")  # replacing 1 'i'
# 1
adist("iiiii","xiyiz")  # replacing 3
# 3

If more than one type of edit is needed to turn one string into another, the edit distance across different types is additive.

adist("weasel","Vin Deasel")  # 4 insertions, then replacing 'w' with 'D'
# 5

The edit distance is the 'shorted path' distance between two strings. This means that two strings might be closer than you expect. For example, you can get from "Moos" to "oose" with three substitutions, but adist() will return a smaller distance because a simpler transformation exists (delete 'M', then insert 'e' at the end).

adist("Moos", "oose")

The edit distance from one string to another is not always the sum of the edit distances between them and an intermediate string, but it is never the more than the sum. In other words the triangle inequality holds.

### These:
adist("triangle inequality?","what's that?")
adist("what's that?","some math thing.")
### 15 and 12

### Do not add to this:
adist("triangle inequality?","some math thing.")
### 17

By default each type of primary edit, insertions, deletions, and substitutions, contributes and edit distance of 1. This can be changed with an option in adist(). As above, the smallest distance is used, so watch your weights.

adist("insert3","in---sert3", costs=list(insertion=1)) # 3
adist("insert3","in---sert3", costs=list(insertion=50)) # 150
adist("Mooo!","Moo", costs=list(insertion=0, deletion=50)) # 100

Note the case sensitivity. Since to a computer an upper case 'R' and a lower case 'r' are two completely different letters, differences in cases count towards the edit distance.

str1 = "youch that hurt!"
str2 = "YOUCH THAT HURT!"
adist(str1,str2) ## edit distance 13

If you don't want edit distance to count cases, you can either use functions to change the case of the strings, or use an option in adist()

adist( toupper(str1), str2)
adist( str1, tolower(str2))
adist( str1, str2, ignore.case=TRUE) # all zero distance

Leading and trailing spaces can also be an issue because they contribute to distance, but you can remove them with the str_trim() function in the stringr package.

str1 = "Mars"
str2 = "                 Mars              "
adist(str1,str2) # distance is vast
adist(str1, str_trim(str2)) # distance is nothing

Transpositions, which are the swapping of the characters or words, are more problematic. These are interpreted as large numbers of substitutions or insertions/deletions, usually.

adist("Animal", "Aminal") # 2
adist("Down Under", "Under Down") # 10
adist("Life isn't fair.","fair Life isn't.") # 10

Edit distance is used for fuzzy matching.

Fuzzy matching, in contrast to the exact matching that we dealt with in the last text processing lesson, is designed to detect strings that are close to a target string (or string pattern) as well as ones that fit the string (pattern) completely. Common systems that use fuzzy matching (and other things) include spell checkers and search engines when suggestions are made that are close to what you wrote. Some of the suggestions that come up will be selected because they have a minimal edit distance subject to some weighting based on common misspellings or typos.

We can build a workable fuzzy matching function in R in less than ten lines.

Suppose we had the transcript of the events of a Major League Soccer match from ESPN, like this one between the Houston Dynamo and the New England Revolution ( ESPN_MLS_Example.txt , thanks to Rajitha Silva for letting me give away part of his research like this ), and we wanted to identify the team associated with certain lines, but there was a lot of extra and messy information in the way. This includes lines from the example text like:

Houston Dynamo Dominic Oduro 58'
New England Revolution • Shalrie Joseph 43' • Kheli Dube 73'
Joseph Ngwenya (Houston Dynamo) receives a red card.
Houston Dynamo makes a substitution: Joseph Ngwenya enters for Cameron Weaver
Goal !!! Kheli Dube scores for New England Revolution

In each case, you can identify the team of interest by looking at the edit distance between the team name and each of the lines. Most of time (alas, there are always edge cases), the team with the shorter edit distance to the line of text is the team of interest.

adist("Houston Dynamo Dominic Oduro 58'","Houston Dynamo")
18

adist("Houston Dynamo Dominic Oduro 58'","New England Revolution")
27

Using this principle, we can make a simple 'choose the least' system.

teamlist = c("Houston Dynamo", "New England Revolution", 
 "Chicago Fire", "Montreal Impact", "Vancouver Whitecaps")

clean_teamname = function(raw_name, teamlist)
{
    #optional
    require(stringr)
    raw_name = str_trim(raw_name)
    
    #necessary
    ed_dist = adist(raw_name, teamlist)
    best = which(ed_dist == min(ed_dist))[1] # [1] breaks ties
    return(teamlist[best]) # Return the best fitting name
}

Then we test it out:

clean_teamname("Houston Dynamo Dominic Oduro 58' ",teamlist)
[1] "Houston Dynamo"

clean_teamname(

"Houston Dynamo makes a substitution: Joseph Ngwenya enters for Cameron Weaver",

teamlist)
[1] "Houston Dynamo"

clean_teamname(

"Joseph Ngwenya (Houston Dynamo) receives a red card. ",teamlist)
[1] "Houston Dynamo"

clean_teamname("Goal !!! Kheli Dube scores for New England Revolution",teamlist)
[1] "New England Revolution"

Chica is a little skeptical about that last yellow card.

Statistics et al.

Featured post

Textbook: Writing for Statistics and Data Science

Tuesday, 9 April 2019

Natural Language Processing in R: Edit Distance

No comments:

Post a Comment