MettaProgramming

Thoughts on Software and Technology

Fun With Regex: Trying To Approach Natural Language Parsing

leave a comment

I hate Regular Expressions.

I don’t hate them because they’re broken, or stupid, or don’t work. Rather, I hate them because I’m one of those few people who started programming back in the days of the Commodore Vic-20 and who still has not taken the time to learn them. Actually, I don’t hate them at all, I simply fear them.

It’s like talking to a 50 year old who says “I always meant to learn how to swim, I just never got around to it.” That’s the way I am with Regular Expressions. I always meant to learn them… which basically translates to “I always find a way to shy away from them.” It’s a ridiculous state, and one that I very rarely find myself in.

Well, I’ve started diving deep lately for a side Ruby project and am pretty happy with the result. I thought I’d go through the development of what is to me a slightly complex regular expression construct, more for my own memory and development than to teach anything which I don’t know enough about to teach.

First Try

The idea of Regular Expressions is to parse a string for a pattern. The need I have involves parsing a string for a set of specific instructions resulting in points being added to, or taken from, a given user. For instance, I need to parse a string’s similar to “@Scorekeeper give 4 points to @USER_A” and “@Scorekeeper give @USER_B 3 points.” I also need to parse negative actions, so “@Scorekeeper take 3 points from @USER_A” is a valid expression as well.

So, let’s generate a couple simple strings to parse that has a few of those elements:


text1 = "@Scorekeeper give 4 points to @USER_A for doing something really cool. Then give @USER_B 5 points for working hard. Then take 7 from @USER_C for being a jerk"

The relevant parts we need to get from this are “USER_A gets 4 points,” “USER_B gets 5 points,” and “USER_C looses 7 points.” My original strategy was to search using the two simple patterns in succession, basically using something like this:


…

matches = text.scan(/[+-]?[0-9]{1,2} [^@]*[@]+[A-Za-z0-9-_]+/)

matches += text.scan(/[@]+[A-Za-z0-9-_]+ [+-]?[0-9]{1,2}/)

…

Okay, I said that I wasn’t good with regular expressions, right? The first expression was supposed to match an optional plus/minus sign followed immediately by 1-2 digits and then ignore everything until getting to an “@” symbol, matching everything after that in the word. The second was supposed to match basically that same pattern, but backwards (name, then numbers).

Ignoring the ugly patterns, I’ll skip right to the part that made me finally do some digging. RegEx gurus will already see that this pattern gives 4 points to USER_A, but then also gives 4 to USER_B, then gives 4 points to USER_C. The problem is that it matches on 4->user and does this all the way through, giving us three matches.

Help me StackOverflow!

After struggling with with whether to create ridiculously long expressions, and even whether to create a DSL or an interpreter, I ended up learning that RegEx is neither as scary nor as difficult as I thought. Consider the following take on the pattern given to me by a friendly Stack Overflow user:


PATTERN = /give ([0-9]+) points to @(.+?)/
input =~ PATTERN
points  = $~[1] # => "4"
user    = $~[2] # => "USER_A"

Not only does this simplify username match, it puts the match in syntactic context. Rather than matching inappropriately, it now only matches “give X points to USER.” Of course, this doesn’t match the other users, because the sentence isn’t exactly the same. What I needed to do was come up with something to allow both giving and taking, in multiple formats. this pattern is closer:


PATTERN2 = /b(give|take) ([+-]?[0-9]) b(to|from) @(.+?)/i

arr = text.scan(PATTERN2)

=> [["give", "4", "to", "USER_A"], ["take", "7", "from", "USER_C"]]

Here, I’m saying match either “give” or “take,” then match a number (optionally preceeded by a plus or minus sign), followed by either “to” or “from” and then a name. Additionally, make the match case insensitive (the trailing “i”) in case someone types “Give” or even “taKe.” The match captures each of these elements into separate arrays, so the output is easy to work with.

Making It More Robust

The problem here is that someone might accidentally type two spaces. Or a random letter. For instance, what if someone types “give a -3 to @USER_A?” that pattern won’t match. It only matches on “give<space>number.” It doesn’t even match on “give<space><space>number.” For these problems, I decided to add the “s” switch which matches on any whitespace, and the “w” switch, which matches on any word. What I came up with was [sw]* which is syntactically equivalent to “match any whitespace or any word zero or more times.”

That small addition gives us this:


SEARCH_STRING = "@Scorekeeper give a healthy 4 to the great @USER_A for doing something really cool.Then give @USER_B a whooping 5 points for working on this. Then take 7 points from @USER_C."

PATTERN3 = /b(give|take)[sw]*([+-]?[0-9])[sw]*b(to|from)[sw]*@(.+?)b/i

>> arr = SEARCH_STRING.scan(PATTERN_1)
=> [["give", "4", "to", "USER_A"], ["take", "7", "from", "USER_C"]]

As you can see, now we have the ability to add adjectives to our phrase, getting even closer to natural language– or, at least language as natural as we need for this one application. But what if someone accidentally writes “take seven points from @USER_C?” Then we’re sunk. Luckily, for this application, we’re only allowing up to 10 points per score, so we can easily add those words with the OR pipe and process it on the backend:


PATTERN3 = /b(give|take)[sw]*([+-]?[0-9]|one|two|three|four|five|six|seven|eight|nine|ten)[sw]*b(to|from)[sw]*@(.+?)b/i

This captures a number– optionally preceded by a plus or minus sign– or any of the words listed. I’ve decided not to catch the word “minus” and count on someone writing “take X from” rather than “give minus X to.”

The Second Pattern

This still misses USER_B. Well, I couldn’t come up with a single expression that matched on both, so I created another one:


PATTERN4 = /bgive[s]*@(.+?)b[sw]*([+-]?[0-9]|one|two|three|four|five|six|seven|eight|nine|ten)/i

Which is a bit of a “swapped around subset” of the former match. This will only match “Give<whitespaces>USER<anything>X points.” My hope was that I could use the [s|w]* pattern after “give” so that we could write “Give the great @USER_B ten points.” Unfortunately, the “w” switch becomes greedy and matches everything from the first “give” through to USER_B. Thus, I’d have to limit that syntax to whitespace only. In order to make that work, I have to say “any whitespace or any word, but not ‘@’.”

I tried everything I could think of with no success, so finally I asked a friend who’s a regex guru and he pointed out that the pattern (.+?) was matching everything and slurping it all into the username capture. Limiting the username match to the original list of legal character names was safer. Thus, I ended up changing both username captures to legal character ranges.

Coda

The final result is fairly robust. It matches on the quick test string where I want it, and doesn’t match where I don’t:

SEARCH_STRING = "@Scorekeeper give a healthy 4 to the great @USER_A for doing something really cool.Then give the friendly @USER_B a healthy five points for working on this. Then take seven points from the jerk @USER_C."

PATTERN_A = /b(give|take)[sw]*([+-]?[0-9]|one|two|three|four|five|six|seven|eight|nine|ten)[sw]*b(to|from)[sw]*@([a-zA-Z0-9_]*)b/i

PATTERN_B = /bgive[sw]*@([a-zA-Z0-9_]*)b[sw]*([+-]?[0-9]|one|two|three|four|five|six|seven|eight|nine|ten)/i

SEARCH_STRING.scan(PATTERN_A) # => [["give", "4", "to", "USER_A"],
                              #     ["take", "seven", "from", "USER_C"]]
SEARCH_STRING.scan(PATTERN_B) # => [["USER_B", "five"]]

The final version searches the text using these two patterns and uses the captured match arrays to process the commands. It allows fun syntax such as “give that wonderful @wajiii ten points for all his help and take 5 points from the creep @lawduck for making me develop this application in the first place.”

It also gets me as close as I need to be in this application to natural language parsing. Although now there’s no reason for me to avoid learning about look ahead parsing now.

[Update]

By the way, a shout out to Rubular, a great online Ruby RegEx syntax tester which was wonderfully helpful during this exercise!

Written by john

February 9th, 2010 at 12:30 pm

Posted in Ruby

Tagged with , ,