Periodically, the Stack Exchange people publish a
dump of the content
of all public Stack Exchange sites. I played with it back in 2009
when this started, but have lost what little code I wrote back then.
I just downloaded the Sep 2011 dump. For StackOverflow alone, here are the
Assuming each row is a line by itself, there were more than six million posts
as of Sep 2011:
According to readme.txt in the dump package, the file posts.xml has the
ParentID (only present if PostTypeId is 2)
AcceptedAnswerId (only present if PostTypeId is 1)
I'm not going to build a DOM tree of 6+ millions posts in RAM yet, so
I'll use a SAX handler to parse the thing. First, install XMLSupport:
As per SAXHandler's class comment, subclass it and override handlers under
the "content" and "lexical" categories as needed:
For a schema as simple as the above, the method of interest is this:
Using a 1-row test set, the following do-it
produces this output:
From here on, it is straightforward to fleshen startElement:attributes: to
extract the stuff that is interesting to me.
To count the actual number of records, just keep a running count as each
post is parsed, and print that number in the method endDocument. The run
took a long time (by the wall clock) and counted 6,479,788 posts, the same
number as produced by egrep'ping rowId.
How about Smalltalk time? Let's ask TimeProfiler.
Btw, saw this comment on HN: "If it fits on an iPod, it's not big data." :-)