Reputation: 276
I would like to go through a couple of web pages
theURLs := #('url1' 'url2' 'url3')
and get the content of the first h1 heading
theURLs collect: [ :anURL | page := HTTPClient httpGetDocument: anURL.
page firstH1heading].
What do I need to put at the place of #firstH1heading ?
Answers for Squeak / Pharo / Cuis are welcome.
In Squeak
HTTPClient httpGetDocument: 'http://pharo.org/'
gives back a
MIMEDocument
So I would expect to do something like
theURLs collect: [ :anURL | page := HTMLDocument on:
(HTTPClient httpGetDocument: anURL).
page firstH1heading].
But in Squeak 4.6 there is no HTMLDocument class though it seems there used to be one. (http://wiki.squeak.org/squeak/2249). The Wiki says that I should load a package Network-HTML. The SqueakMap catalog of Squeak 4.6 has a package 'XMLParser-HTML'. Can this be used instead?
Upvotes: 2
Views: 132
Reputation: 15907
I've updated the configuration. You might need to refresh the catalog
Name: ConfigurationOfSoup-StephanEggermont.75
Author: StephanEggermont
Time: 14 December 2015, 1:39:52.307715 pm
UUID: 6c11fb83-5299-4852-9563-73ecc34992a0
Ancestors: ConfigurationOfSoup-FrancoisStephany.74
Adopted bug fix to stable 1.7.1 , added Pharo 5 versions
Upvotes: 2
Reputation: 5041
In Pharo, you can use the Soup package. Install it via the Configuration Browser.
You retrieve a document from an URL with Zinc, and find the first <h1>
tag with Soup like this:
|contents soup body|
contents := ZnClient new get: 'http://zn.stfx.eu/zn/small.html'.
soup := Soup fromString: contents.
body := soup body.
body findTag: 'h1'
Upvotes: 3