How do I collect the h1 headings of a number of web pages?

Question

I would like to go through a couple of web pages

 theURLs := #('url1' 'url2' 'url3')

and get the content of the first h1 heading

theURLs collect: [ :anURL |  page := HTTPClient httpGetDocument: anURL.
                             page firstH1heading].

Question

What do I need to put at the place of #firstH1heading ?

Answers for Squeak / Pharo / Cuis are welcome.

Note

In Squeak

HTTPClient httpGetDocument: 'http://pharo.org/'

gives back a

MIMEDocument

So I would expect to do something like

theURLs collect: [ :anURL |  page := HTMLDocument on: 
                                     (HTTPClient httpGetDocument: anURL).
                             page firstH1heading].

But in Squeak 4.6 there is no HTMLDocument class though it seems there used to be one. (http://wiki.squeak.org/squeak/2249). The Wiki says that I should load a package Network-HTML. The SqueakMap catalog of Squeak 4.6 has a package 'XMLParser-HTML'. Can this be used instead?

MartinW · Accepted Answer

In Pharo, you can use the Soup package. Install it via the Configuration Browser.

You retrieve a document from an URL with Zinc, and find the first

tag with Soup like this:

|contents soup body|
contents := ZnClient new get: 'http://zn.stfx.eu/zn/small.html'.
soup := Soup fromString: contents.
body := soup body.
body findTag: 'h1'

How do I collect the h1 headings of a number of web pages?

Question

Note

Answers (2)

Related Questions