Soup is a port of the Python Beautiful Soup HTML parser.
Soup can be used to parse RSS.
The above code produces the following output:
ROE with NBSQLite3 Enduring Scaling Web Applications on PostgreSQL NBSQLite3 - NativeBoost SQLite for Pharo OSX NativeBoost FFI Oddity
Tested on Pharo versions 2.0 and 3.0 beta.
Periodically, the Stack Exchange people publish a dump of the content of all public Stack Exchange sites. I played with it back in 2009 when this started, but have lost what little code I wrote back then.
I just downloaded the Sep 2011 dump. For StackOverflow alone, here are the file sizes:
Assuming each row is a line by itself, there were more than six million posts as of Sep 2011:
According to readme.txt in the dump package, the file posts.xml has the following schema:
- 1: Question
- 2: Answer
- ParentID (only present if PostTypeId is 2)
- AcceptedAnswerId (only present if PostTypeId is 1)
- LastEditorDisplayName="Jeff Atwood"
I'm not going to build a DOM tree of 6+ millions posts in RAM yet, so I'll use a SAX handler to parse the thing. First, install XMLSupport:
As per SAXHandler's class comment, subclass it and override handlers under the "content" and "lexical" categories as needed:
For a schema as simple as the above, the method of interest is this:
Using a 1-row test set, the following do-it
produces this output:
From here on, it is straightforward to fleshen startElement:attributes: to extract the stuff that is interesting to me.
To count the actual number of records, just keep a running count as each post is parsed, and print that number in the method endDocument. The run took a long time (by the wall clock) and counted 6,479,788 posts, the same number as produced by egrep'ping rowId.
How about Smalltalk time? Let's ask TimeProfiler.
Btw, saw this comment on HN: "If it fits on an iPod, it's not big data." :-)
Invoking the parser is simple. The below code allows one to explore the Smalltalk object representing the parsed HTML file:
Since the HTML is generated by code I wrote in the first place, I know that my content is in the DIV of class 'clearfix', and every HTML file has just one such DIV.
The above code yields the following:
Having extracted the content fragment, it is now a matter of writing the fragment out through an iPhone-specific HTML template.
'<p>I’ve been dabbling with Squeak, Pharo, and other Smalltalk implementations for a while. I’ve now taken the plunge to blog about Smalltalk.</p> <p>Running one of the leading blog engines (installing PHP, MySQL, etc., keeping up with security patches for the entire software stack, yada yada) is simply too much hassle. For now this blog is made up of static pages, managed by code written in Pharo.</p> <p>Pierce</p>'
In terms of Smalltalk programming, the parsing and searching code is actually "discovered" iteratively by browsing code in the code browser, and running code on live objects using the object explorer, as shown by the screenshots below. These provide a very good illustration of the power of Smalltalk's integrated environment.
Figure 1: Parse an HTML file and explore the resultant Smalltalk object. This results in the object explorer in figure 2.
Figure 2: In this object explorer, "explore" the result of running the code shown in its workspace pane. Here "self" refers to the HtmlDocument instance. This brings up another object explorer shown in figure 3.
Figure 3: The object explorer on the right shows the array returned by the "self nodesSelect: ..." code executed in the object explorer on the left. This time we "print" the output of "self first innerContents" in its workspace pane; here "self" refers to the array.