pvd
pvd

Reputation: 1343

How to extract element from html in Racket?

I want to extract the urls in reddit, my code is

#lang racket

(require net/url)
(require html)

(define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))
(define in (get-pure-port reddit #:redirections 5))

(define response-html (read-html-as-xml in))
(define content-0 (list-ref response-html 0))

(close-input-port in)

The content-0 above is

(element
 (location 0 0 15)
 (location 0 0 82)
...

I'm wondering how to extract specific content from it.

Upvotes: 3

Views: 885

Answers (1)

Greg Hendershott
Greg Hendershott

Reputation: 16260

  1. Usually it's more convenient to deal with HTML as x-expressions instead of the html module's structs.

  2. Also you should probably use call/input-url to handle closing the port automatically.

You can combine both of these ideas by defining a read-html-as-xexpr function and using it like this:

#lang racket/base

(require html
         net/url
         xml)

(define (read-html-as-xexpr in) ;; input-port? -> xexpr?
  (caddr
   (xml->xexpr
    (element #f #f 'root '()
             (read-html-as-xml in)))))

(define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))

(call/input-url reddit
                get-pure-port
                read-html-as-xexpr)

That will return a big x-expression like:

'(html
  ((lang "en") (xml:lang "en") (xmlns "http://www.w3.org/1999/xhtml"))
  (head
   ()
   (title () "programming: search results")
   (meta
    ((content " reddit, reddit.com, vote, comment, submit ")
     (name "keywords")))
   (meta
    ((content "reddit: the front page of the internet") (name "description")))
   (meta ((content "origin") (name "referrer")))
   (meta ((content "text/html; charset=UTF-8") (http-equiv "Content-Type")))
... snip ...

How to extract specific pieces of that?

  • For simple HTML where I don't expect the overall structure to change, I will often just use match.

  • However a more correct and robust way to go about it is to use the xml/path module.



UPDATE: I noticed your question started by asking about extracting URLs. Here's the example updated to use se-path*/list to get all the href attributes of all the <a> elements:

#lang racket/base

(require html
         net/url
         xml
         xml/path)

(define (read-html-as-xexprs in) ;; (-> input-port? xexpr?)
  (caddr
   (xml->xexpr
    (element #f #f 'root '()
             (read-html-as-xml in)))))

(define reddit (string->url "http://www.reddit.com/r/programming/search?q=racket&sort=relevance&restrict_sr=on&t=all"))

(define xe (call/input-url reddit
                           get-pure-port
                           read-html-as-xexprs))

(se-path*/list '(a #:href) xe)

Result:

'("#content"
  "http://www.reddit.com/r/announcements/"
  "http://www.reddit.com/r/Art/"
  "http://www.reddit.com/r/AskReddit/"
  "http://www.reddit.com/r/askscience/"
  "http://www.reddit.com/r/aww/"
  "http://www.reddit.com/r/blog/"
  "http://www.reddit.com/r/books/"
  "http://www.reddit.com/r/creepy/"
  "http://www.reddit.com/r/dataisbeautiful/"
  "http://www.reddit.com/r/DIY/"
  "http://www.reddit.com/r/Documentaries/"
  "http://www.reddit.com/r/EarthPorn/"
  "http://www.reddit.com/r/explainlikeimfive/"
  "http://www.reddit.com/r/Fitness/"
  "http://www.reddit.com/r/food/"
  ... snip ...

Upvotes: 7

Related Questions