Pedro Delfino
Pedro Delfino

Reputation: 2681

How to use Common Lisp libraries of dex, plump, and clss to extract the title of a web page?

I am using Emacs, Slime, and SBCL to develop Common Lisp in a Desktop PC running NixOS.

In addition, I am using the libraries dex, plump, and clss to extract the title of a webpage. Thus, I did:

CL-USER> (clss:select "title" (plump:parse  (dex:get "http://www.pdelfino.com.br")))
#(#<PLUMP-DOM:ELEMENT title {1009C488E3}>)

I was expecting: "Pedro Delfino".

Instead, I got the object:

#(#<PLUMP-DOM:ELEMENT title {1009C488E3}>)

If I describe the object it does not help me finding the value I want:

CL-USER> (clss:select "title" (plump:parse  (dex:get "http://www.pdelfino.com.br")))
#(#<PLUMP-DOM:ELEMENT title {100A9888E3}>)
CL-USER> (describe *)
#(#<PLUMP-DOM:ELEMENT title {100A9888E3}>)
  [vector]

Element-type: T
Fill-pointer: 1
Size: 10
Adjustable: yes
Displaced: no
Storage vector: #<(SIMPLE-VECTOR 10) {100A9B65BF}>
; No value
CL-USER> 

Where is the value that I need?

Thanks

Upvotes: 1

Views: 250

Answers (2)

Ehvince
Ehvince

Reputation: 18375

You can ask plump to return the text inside the HTML node with plump:text. It accepts one node, and not an array (returned by clss:select), so you have to use aref to get the first one.

(plump:text (aref  
   (clss:select "title" (plump:parse  
     (dex:get "http://www.pdelfino.com.br"))) 
   0))

plump:serialize would return the HTML content (useful to inspect the results).

You can also use CLSS and Plump together at the same time by using LQuery. https://shinmera.github.io/lquery/ We need to parse the HTML with initialize, then we use $ as in (lquery:$ <document> "selector"). We can add (text) or (serialize) as last arguments.

(defparameter *PDELFINO-PARSED* (lquery:$ (initialize (dex:get "http://www.pdelfino.com.br"))))

(lquery:$ *PDELFINO-PARSED* "title")
#(#<PLUMP-DOM:ELEMENT title {1008645923}>)

CIEL-USER> (lquery:$ *PDELFINO-PARSED* "title" (text))
#("Pedro Delfino")

CIEL-USER> (aref * 0)
"Pedro Delfino"

CIEL-USER> (lquery:$ *PDELFINO-PARSED* "title" (serialize))
#("<title>Pedro Delfino</title>")

Upvotes: 2

Xach
Xach

Reputation: 11854

The text of the title is in its child text-node.

(plump:text (plump:first-child (aref (clss:select "title" (plump:parse (dex:get "http://www.pdelfino.com.br"))) 0))) will return that text in this example.

Upvotes: 2

Related Questions