Periodically, the Stack Exchange people publish a dump of the content of all public Stack Exchange sites. I played with it back in 2009 when this started, but have lost what little code I wrote back then.
I just downloaded the Sep 2011 dump. For StackOverflow alone, here are the file sizes:
Assuming each row is a line by itself, there were more than six million posts as of Sep 2011:
According to readme.txt in the dump package, the file posts.xml has the following schema:
- 1: Question
- 2: Answer
- ParentID (only present if PostTypeId is 2)
- AcceptedAnswerId (only present if PostTypeId is 1)
- LastEditorDisplayName="Jeff Atwood"
I'm not going to build a DOM tree of 6+ millions posts in RAM yet, so I'll use a SAX handler to parse the thing. First, install XMLSupport:
As per SAXHandler's class comment, subclass it and override handlers under the "content" and "lexical" categories as needed:
For a schema as simple as the above, the method of interest is this:
Using a 1-row test set, the following do-it
produces this output:
From here on, it is straightforward to fleshen startElement:attributes: to extract the stuff that is interesting to me.
To count the actual number of records, just keep a running count as each post is parsed, and print that number in the method endDocument. The run took a long time (by the wall clock) and counted 6,479,788 posts, the same number as produced by egrep'ping rowId.
How about Smalltalk time? Let's ask TimeProfiler.
Btw, saw this comment on HN: "If it fits on an iPod, it's not big data." :-)