« Previous  |  Next »

Parsing HTML

9 April 2011

Continuing from my previous post, I am using the excellent HTML & CSS Validating Parser.

Invoking the parser is simple. The below code allows one to explore the Smalltalk object representing the parsed HTML file:

| doc div |
FileStream readOnlyFileNamed: '/tmp/about.html' do: [ :file |
    doc := HtmlDocument new parseContents: file ]
doc explore.

Since the HTML is generated by code I wrote in the first place, I know that my content is in the DIV of class 'clearfix', and every HTML file has just one such DIV.

div := doc nodesSelect: [ :ea | 
    (ea classes select: [ :c | c = 'clearfix' ]) size > 0 ].
div first innerContents

The above code yields the following:

'<p>I’ve been dabbling with Squeak, Pharo, and other Smalltalk implementations
for a while. I’ve now taken the plunge to blog about Smalltalk.</p>

<p>Running one of the leading blog engines (installing PHP, MySQL, etc.,
keeping up with security patches for the entire software stack, yada yada)
is simply too much hassle. For now this blog is made up of static pages,
managed by code written in Pharo.</p>

<p>Pierce</p>'
Having extracted the content fragment, it is now a matter of writing the fragment out through an iPhone-specific HTML template.

In terms of Smalltalk programming, the parsing and searching code is actually "discovered" iteratively by browsing code in the code browser, and running code on live objects using the object explorer, as shown by the screenshots below. These provide a very good illustration of the power of Smalltalk's integrated environment.

Figure 1: Parse an HTML file and explore the resultant Smalltalk object. This results in the object explorer in figure 2.

Figure 2: In this object explorer, "explore" the result of running the code shown in its workspace pane. Here "self" refers to the HtmlDocument instance. This brings up another object explorer shown in figure 3.

Figure 3: The object explorer on the right shows the array returned by the "self nodesSelect: ..." code executed in the object explorer on the left. This time we "print" the output of "self first innerContents" in its workspace pane; here "self" refers to the array.

Blog comments powered by Disqus