In the previous entry, I wrote about parsing the StackOverflow XML dump files.
This entry is about setting up SQLite full text search for the StackOverflow posts data. In Smalltalk, set up the database schema:
The key SAX parsing callback method becomes, in essence, as follows:
On my computer, running through the 6+ million posts took about an hour. The resultant SQLite database file is about 6.5GB in size. Here are the comparative file sizes:
Next, let's set up full text search on this database:
I didn't time how long the INSERTs took. Setting up full text search boosted the database's file size:
Now search the database using the sqlite3 shell tool:
% sqlite3 so.db sqlite> select count(*) from stags where tags match 'smalltalk'; 332 sqlite> select max(id) from stags where tags match 'smalltalk'; 7260027 sqlite> select body, tags from posts where id = 7260027; <p>what's the best server side programming language for web development if we keep very good performance and a lot of database interaction of the script with server in mind. i am already working on a project with php, i'm fearing that my application might have a lot of traffic, i just want to use only the server side language which is best in terms of performance, and please also guide me any major step i should take to prevent my application from crashing when it will face a huge traffic, what major step i might need to take to stop myself crying after launching my applicaition, i heard that twitter use ruby, how's ruby's performance? my concern is only and only performance and database interaction, thanks in advance.</p> |<php><python><ruby-on-rails><ruby><smalltalk> sqlite>
Periodically, the Stack Exchange people publish a dump of the content of all public Stack Exchange sites. I played with it back in 2009 when this started, but have lost what little code I wrote back then.
I just downloaded the Sep 2011 dump. For StackOverflow alone, here are the file sizes:
Assuming each row is a line by itself, there were more than six million posts as of Sep 2011:
According to readme.txt in the dump package, the file posts.xml has the following schema:
- 1: Question
- 2: Answer
- ParentID (only present if PostTypeId is 2)
- AcceptedAnswerId (only present if PostTypeId is 1)
- LastEditorDisplayName="Jeff Atwood"
I'm not going to build a DOM tree of 6+ millions posts in RAM yet, so I'll use a SAX handler to parse the thing. First, install XMLSupport:
As per SAXHandler's class comment, subclass it and override handlers under the "content" and "lexical" categories as needed:
For a schema as simple as the above, the method of interest is this:
Using a 1-row test set, the following do-it
produces this output:
From here on, it is straightforward to fleshen startElement:attributes: to extract the stuff that is interesting to me.
To count the actual number of records, just keep a running count as each post is parsed, and print that number in the method endDocument. The run took a long time (by the wall clock) and counted 6,479,788 posts, the same number as produced by egrep'ping rowId.
How about Smalltalk time? Let's ask TimeProfiler.
Btw, saw this comment on HN: "If it fits on an iPod, it's not big data." :-)