<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:series="http://unfoldingneurons.com/"
	>

<channel>
	<title>MettaProgramming &#187; parsing</title>
	<atom:link href="http://mettadore.com/tag/parsing/feed/" rel="self" type="application/rss+xml" />
	<link>http://mettadore.com</link>
	<description>Thoughts on Software and Technology</description>
	<lastBuildDate>Mon, 09 Apr 2012 19:11:59 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
<xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml" name="robots" content="noindex" />
		<item>
		<title>Fun With Regex: Trying To Approach Natural Language Parsing</title>
		<link>http://mettadore.com/ruby/fun-with-regex-trying-to-approach-natural-language-parsing/</link>
		<comments>http://mettadore.com/ruby/fun-with-regex-trying-to-approach-natural-language-parsing/#comments</comments>
		<pubDate>Tue, 09 Feb 2010 20:30:23 +0000</pubDate>
		<dc:creator>john</dc:creator>
				<category><![CDATA[Ruby]]></category>
		<category><![CDATA[natural language]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[regex]]></category>

		<guid isPermaLink="false">http://mettadore.com/?p=284</guid>
		<description><![CDATA[I hate Regular Expressions. I don&#8217;t hate them because they&#8217;re broken, or stupid, or don&#8217;t work. Rather, I hate them because I&#8217;m one of those few people who started programming back in the days of the Commodore Vic-20 and who still has not taken the time to learn them. Actually, I don&#8217;t hate them at [...]]]></description>
			<content:encoded><![CDATA[<p>I hate Regular Expressions.</p>
<p>I don&#8217;t hate them because they&#8217;re broken, or stupid, or don&#8217;t work. Rather, I hate them because I&#8217;m one of those few people who started programming back in the days of the Commodore Vic-20 and who <em>still</em> has not taken the time to learn them. Actually, I don&#8217;t <em>hate</em> them at all, I simply <em>fear</em> them.</p>
<p>It&#8217;s like talking to a 50 year old who says &#8220;I always meant to learn how to swim, I just never got around to it.&#8221; That&#8217;s the way I am with Regular Expressions. I always meant to learn them… which basically translates to &#8220;I always find a way to shy away from them.&#8221; It&#8217;s a ridiculous state, and one that I very rarely find myself in.</p>
<p>Well, I&#8217;ve started diving deep lately for a side Ruby project and am pretty happy with the result. I thought I&#8217;d go through the development of what is to me a slightly complex regular expression construct, more for my own memory and development than to teach anything which I don&#8217;t know enough about to teach.<span id="more-284"></span></p>
<h3>First Try</h3>
<p>The idea of Regular Expressions is to parse a string for a pattern. The need I have involves parsing a string for a set of specific instructions resulting in points being added to, or taken from, a given user. For instance, I need to parse a string&#8217;s similar to &#8220;@Scorekeeper give 4 points to @USER_A&#8221; and &#8220;@Scorekeeper give @USER_B 3 points.&#8221; I also need to parse negative actions, so &#8220;@Scorekeeper take 3 points from @USER_A&#8221; is a valid expression as well.</p>
<p>So, let&#8217;s generate a couple simple strings to parse that has a few of those elements:</p>
<pre class="brush: ruby; title: ; notranslate">

text1 = &quot;@Scorekeeper give 4 points to @USER_A for doing something really cool. Then give @USER_B 5 points for working hard. Then take 7 from @USER_C for being a jerk&quot;
</pre>
<p>The relevant parts we need to get from this are &#8220;USER_A gets 4 points,&#8221; &#8220;USER_B gets 5 points,&#8221; and &#8220;USER_C looses 7 points.&#8221; My original strategy was to search using the two simple patterns in succession, basically using something like this:</p>
<pre class="brush: ruby; title: ; notranslate">

…

matches = text.scan(/[+-]?[0-9]{1,2} [^@]*[@]+[A-Za-z0-9-_]+/)

matches += text.scan(/[@]+[A-Za-z0-9-_]+ [+-]?[0-9]{1,2}/)

…
</pre>
<p>Okay, I said that I wasn&#8217;t good with regular expressions, right? The first expression was supposed to match an optional plus/minus sign followed immediately by 1-2 digits and then ignore everything until getting to an &#8220;@&#8221; symbol, matching everything after that in the word. The second was supposed to match basically that same pattern, but backwards (name, then numbers).</p>
<p>Ignoring the ugly patterns, I&#8217;ll skip right to the part that made me finally do some digging. RegEx gurus will already see that this pattern gives 4 points to USER_A, but then also gives 4 to USER_B, then <em>gives</em> 4 points to USER_C. The problem is that it matches on 4-&gt;user and does this all the way through, giving us three matches.</p>
<h3>Help me StackOverflow!</h3>
<p>After struggling with with whether to create ridiculously long expressions, and even whether to create a DSL or an interpreter, I ended up learning that RegEx is neither as scary nor as difficult as I thought. Consider the following take on the pattern given to me by <a href="http://stackoverflow.com/questions/2224121/building-a-semi-natural-language-dsl-in-ruby">a friendly Stack Overflow user</a>:</p>
<pre class="brush: ruby; title: ; notranslate">

PATTERN = /give ([0-9]+) points to @(.+?)/
input =~ PATTERN
points  = $~[1] # =&gt; &quot;4&quot;
user    = $~[2] # =&gt; &quot;USER_A&quot;
</pre>
<p>Not only does this simplify username match, it puts the match in syntactic context. Rather than matching inappropriately, it now only matches &#8220;give X points to USER.&#8221; Of course, this doesn&#8217;t match the other users, because the sentence isn&#8217;t exactly the same. What I needed to do was come up with something to allow both giving and taking, in multiple formats. this pattern is closer:</p>
<pre class="brush: ruby; title: ; notranslate">

PATTERN2 = /b(give|take) ([+-]?[0-9]) b(to|from) @(.+?)/i

arr = text.scan(PATTERN2)

=&gt; [[&quot;give&quot;, &quot;4&quot;, &quot;to&quot;, &quot;USER_A&quot;], [&quot;take&quot;, &quot;7&quot;, &quot;from&quot;, &quot;USER_C&quot;]]
</pre>
<p>Here, I&#8217;m saying match <em>either</em> &#8220;give&#8221; <em>or</em> &#8220;take,&#8221; then match a number (optionally preceeded by a plus or minus sign), followed by either &#8220;to&#8221; or &#8220;from&#8221; and then a name. Additionally, make the match case insensitive (the trailing &#8220;i&#8221;) in case someone types &#8220;Give&#8221; or even &#8220;taKe.&#8221; The match captures each of these elements into separate arrays, so the output is easy to work with.</p>
<h3>Making It More Robust</h3>
<p>The problem here is that someone might accidentally type two spaces. Or a random letter. For instance, what if someone types &#8220;give a -3 to @USER_A?&#8221; that pattern won&#8217;t match. It only matches on &#8220;give&lt;space&gt;number.&#8221; It doesn&#8217;t even match on &#8220;give&lt;space&gt;&lt;space&gt;number.&#8221; For these problems, I decided to add the &#8220;s&#8221; switch which matches on any whitespace, and the &#8220;w&#8221; switch, which matches on any word. What I came up with was <span style="font-family: andale mono,times">[sw]*</span> which is syntactically equivalent to &#8220;match any whitespace or any word zero or more times.&#8221;</p>
<p>That small addition gives us this:</p>
<pre class="brush: ruby; title: ; notranslate">

SEARCH_STRING = &quot;@Scorekeeper give a healthy 4 to the great @USER_A for doing something really cool.Then give @USER_B a whooping 5 points for working on this. Then take 7 points from @USER_C.&quot;

PATTERN3 = /b(give|take)[sw]*([+-]?[0-9])[sw]*b(to|from)[sw]*@(.+?)b/i

&gt;&gt; arr = SEARCH_STRING.scan(PATTERN_1)
=&gt; [[&quot;give&quot;, &quot;4&quot;, &quot;to&quot;, &quot;USER_A&quot;], [&quot;take&quot;, &quot;7&quot;, &quot;from&quot;, &quot;USER_C&quot;]]
</pre>
<p>As you can see, now we have the ability to add adjectives to our phrase, getting even closer to natural language&#8211; or, at least language as natural as we need for this one application. But what if someone accidentally writes &#8220;take <em>seven</em> points from @USER_C?&#8221; Then we&#8217;re sunk. Luckily, for this application, we&#8217;re only allowing up to 10 points per score, so we can easily add those words with the OR pipe and process it on the backend:</p>
<pre class="brush: ruby; title: ; notranslate">

PATTERN3 = /b(give|take)[sw]*([+-]?[0-9]|one|two|three|four|five|six|seven|eight|nine|ten)[sw]*b(to|from)[sw]*@(.+?)b/i
</pre>
<p>This captures a number&#8211; optionally preceded by a plus or minus sign&#8211; or any of the words listed. I&#8217;ve decided not to catch the word &#8220;minus&#8221; and count on someone writing &#8220;take X from&#8221; rather than &#8220;give minus X to.&#8221;</p>
<h3>The Second Pattern</h3>
<p>This still misses USER_B. Well, I couldn&#8217;t come up with a single expression that matched on both, so I created another one:</p>
<pre class="brush: ruby; title: ; notranslate">

PATTERN4 = /bgive[s]*@(.+?)b[sw]*([+-]?[0-9]|one|two|three|four|five|six|seven|eight|nine|ten)/i
</pre>
<p>Which is a bit of a &#8220;swapped around subset&#8221; of the former match. This will only match &#8220;Give&lt;whitespaces&gt;USER&lt;anything&gt;X points.&#8221; My hope was that I could use the <span style="font-family: andale mono,times">[s|w]*</span> pattern after &#8220;give&#8221; so that we could write &#8220;Give the great @USER_B ten points.&#8221; Unfortunately, the &#8220;w&#8221; switch becomes greedy and matches everything from the first &#8220;give&#8221; through to USER_B. Thus, I&#8217;d have to limit that syntax to whitespace only. In order to make that work, I have to say &#8220;any whitespace or any word, but not &#8216;@&#8217;.&#8221;</p>
<p>I tried everything I could think of with no success, so finally I asked <a href="http://twitter.com/wajiii">a friend</a> who&#8217;s a regex guru and he pointed out that the pattern <span style="font-family: andale mono,times">(.+?)</span> was matching everything and slurping it all into the username capture. Limiting the username match to the original list of legal character names was safer. Thus, I ended up changing both username captures to legal character ranges.</p>
<h3>Coda</h3>
<p>The final result is fairly robust. It matches on the quick test string where I want it, and doesn&#8217;t match where I don&#8217;t:</p>
<pre class="brush: ruby; title: ; notranslate">
SEARCH_STRING = &quot;@Scorekeeper give a healthy 4 to the great @USER_A for doing something really cool.Then give the friendly @USER_B a healthy five points for working on this. Then take seven points from the jerk @USER_C.&quot;

PATTERN_A = /b(give|take)[sw]*([+-]?[0-9]|one|two|three|four|five|six|seven|eight|nine|ten)[sw]*b(to|from)[sw]*@([a-zA-Z0-9_]*)b/i

PATTERN_B = /bgive[sw]*@([a-zA-Z0-9_]*)b[sw]*([+-]?[0-9]|one|two|three|four|five|six|seven|eight|nine|ten)/i

SEARCH_STRING.scan(PATTERN_A) # =&gt; [[&quot;give&quot;, &quot;4&quot;, &quot;to&quot;, &quot;USER_A&quot;],
                              #     [&quot;take&quot;, &quot;seven&quot;, &quot;from&quot;, &quot;USER_C&quot;]]
SEARCH_STRING.scan(PATTERN_B) # =&gt; [[&quot;USER_B&quot;, &quot;five&quot;]]
</pre>
<p>The final version searches the text using these two patterns and uses the captured match arrays to process the commands. It allows fun syntax such as &#8220;give that wonderful @wajiii ten points for all his help and take 5 points from the creep @lawduck for making me develop this application in the first place.&#8221;</p>
<p>It also gets me as close as I need to be in this application to natural language parsing. Although now there&#8217;s no reason for me to avoid learning about look ahead parsing now.</p>
<h4>[Update]</h4>
<p>By the way, a shout out to <a href="http://rubular.com/">Rubular</a>, a great online Ruby RegEx syntax tester which was wonderfully helpful during this exercise!</p>
]]></content:encoded>
			<wfw:commentRss>http://mettadore.com/ruby/fun-with-regex-trying-to-approach-natural-language-parsing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Responsibility of Web App Programmers</title>
		<link>http://mettadore.com/analysis/the-responsibility-of-web-app-programmers/</link>
		<comments>http://mettadore.com/analysis/the-responsibility-of-web-app-programmers/#comments</comments>
		<pubDate>Fri, 05 Feb 2010 18:46:16 +0000</pubDate>
		<dc:creator>john</dc:creator>
				<category><![CDATA[Miscellany]]></category>
		<category><![CDATA[parsing]]></category>
		<category><![CDATA[stream]]></category>
		<category><![CDATA[Tagnic]]></category>

		<guid isPermaLink="false">http://mettadore.com/?p=271</guid>
		<description><![CDATA[I had an interesting experience yesterday that illustrated to me the danger of building Twitter-based web applications without seriously considering the consequences of what you are doing. It was a perfect low-consequence wake-up call on the incredible responsibility web application programmers have. It was basically a case of my Twitter stream being SPAMmed by an [...]]]></description>
			<content:encoded><![CDATA[<p>I had an interesting experience yesterday that illustrated to me the danger of building Twitter-based web applications without seriously considering the consequences of what you are doing. It was a perfect low-consequence wake-up call on the incredible responsibility web application programmers have.</p>
<p>It was basically a case of my Twitter stream being SPAMmed by an application that I didn&#8217;t know existed and don&#8217;t really care about. This is pretty important to me since I&#8217;m developing the &#8220;for fun&#8221; side project <a href="http://mettadore.com/analysis/featured/tweetscore-data-analysis-for-fun/">Tweetscore</a>, which might have some day done the same thing.<span id="more-271"></span></p>
<p><a href="http://mettadore.com/files/2010/02/Screen-shot-2010-02-05-at-10.09.22-AM.png"><img class="alignleft size-medium wp-image-272" src="http://mettadore.com/files/2010/02/Screen-shot-2010-02-05-at-10.09.22-AM-300x123.png" alt="" width="300" height="123" /></a>Tweetscore is an experiment in contextual ranking systems<sup><a href="http://mettadore.com/analysis/the-responsibility-of-web-app-programmers/#footnote_0_271" id="identifier_0_271" class="footnote-link footnote-identifier-link" title="Some people call it social currency, but I think that&amp;#8217;s taking things a bit too seriously. The anthropologist in me wants to see the word &amp;#8220;currency&amp;#8221; tied to something much more culturally substantial">1</a></sup> designed so that Twitter users can send each other points in their stream.</p>
<p>Yesterday, I met with another developer who is creating a Facebook application and who is interested in combining his app with Tweetscore in some way. After our meeting, we both made posts on Twitter giving each other points for fun (because, well, we were just talking about that). Everything seemed peachy.</p>
<p><a href="http://mettadore.com/files/2010/02/Screen-shot-2010-02-05-at-10.11.56-AM.png"><img class="alignright size-medium wp-image-273" src="http://mettadore.com/files/2010/02/Screen-shot-2010-02-05-at-10.11.56-AM-300x157.png" alt="" width="300" height="157" /></a>Until about 5 minutes later, when I got a message from the @playtagnic account that stated that my friend had given me points on <a href="http://playtagnic.com">PlayTagnic</a>.</p>
<p>What?</p>
<p>Heading over to the site, it turns out that the PlayTagnic application, which was following my friend&#8217;s Twitter stream, parsed out his comment about giving me points, and pulled my Twitter account to create a profile on the Tagnic website so that it could list those points under it.</p>
<p>My first thought: &#8220;What? No one was talking to <em>you</em>!&#8221;</p>
<p><a href="http://mettadore.com/files/2010/02/Screen-shot-2010-02-05-at-10.16.17-AM.png"><img class="alignleft size-medium wp-image-274" src="http://mettadore.com/files/2010/02/Screen-shot-2010-02-05-at-10.16.17-AM-300x158.png" alt="" width="300" height="158" /></a>But wait, there&#8217;s more.</p>
<p>Because my friend used Tweetscore&#8217;s &#8220;conversational aside&#8221; post format (@tweetscore, give @somebody X points for something), the Tagnic web application&#8217;s parsing strategy duplicated the points and did the same thing to the Tweetscore account profile.</p>
<p>So, now, instead of just one user suddenly having a profile that was generated completely out of context, there are <em>two</em> users that do.</p>
<p>My biggest worry is that now this rogue PlayTagnic application is going to parse my Twitter stream, and start SPAMming other people with messages that I never intended it to send, or even see!</p>
<h3>The Responsibility of Web Programmers</h3>
<p>This is one of the main problems with easy web application frameworks, and with programming for a platform such as Twitter.</p>
<p>In a word: Context.</p>
<p>There is more information than you can imagine flowing through that firehose, and people are using the information in ways that are often vastly different than you are, or have imagined. Thus, any sloppiness in your design strategy is going to be magnified… firehose-style.</p>
<p>PlayTagnic is an example, and an ironically timed one. The entire purpose to Tweetscore is to experiment with the possibility of keeping a Twitter application&#8217;s involvement completely contextual. I&#8217;ve worked hard in early development to make sure that it doesn&#8217;t send messages to people who are not interested in its existence. So it&#8217;s pretty ironic when the use of Tweetscore <em>itself</em> causes <em>another</em> application to use the information in a way that&#8217;s completely out of context.</p>
<p>My friend wasn&#8217;t playing their game, he never intended to introduce <em>me</em> to their game, yet there it was, forcing it&#8217;s way into our conversation and causing a conceptual dissonance by sending us random messages out of context&#8211; out of left field, actually.</p>
<p>Current web frameworks make programming easy. Sometimes I wonder if it&#8217;s <em>too</em> easy. You can create and deploy a web application in just a few hours that has the ability to affect literally millions of people.</p>
<p>Talk about needing to acknowledge your own responsibility!</p>
<h3>Coda</h3>
<p>We have a serious responsibility as programmers, moreso now than ever. Programming for a public application is hard, it&#8217;s made harder by informationally dense firehoses like Twitter. We shouldn&#8217;t make things even harder, on ourselves, on each other, and on the innocent bystanders, by causing problems that are orthogonal to our own sphere of concern.</p>
<p>This was a lesson to me. A lesson that I need to be careful. I don&#8217;t want to create an application that does the same thing as PlayTagnic, and force it&#8217;s way into people&#8217;s stream as the equivalent of Twitter SPAM. I&#8217;ll need to think real hard now about Tweetscore&#8217;s parsing, and maybe ensure that I only parse conversations that I know are in context.</p>
<p>I have a responsibility as a programmer. I hope I take that seriously enough.</p>
<ol class="footnotes"><li id="footnote_0_271" class="footnote">Some people call it social currency, but I think that&#8217;s taking things a bit too seriously. The anthropologist in me wants to see the word &#8220;currency&#8221; tied to something much more culturally substantial</li></ol>]]></content:encoded>
			<wfw:commentRss>http://mettadore.com/analysis/the-responsibility-of-web-app-programmers/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

