turingcomplete
turingcomplete

Reputation: 2188

Scraping HTML in lisp

My question is related to another question found here Scraping an HTML table in Common Lisp?

I am trying to extract data from a webpage in common lisp. I am currently using drakma to send the http request, and I'm trying to use chtml to extract the data I am looking for. The webpage I'm trying to scrap is http://erg.delph-in.net/logon, here is my code

(defun send-request (sentence)
 "sends sentence in an http request to logon for parsing, and recieves
  back the webpage containing the MRS output"
 (drakma:http-request "http://erg.delph-in.net/logon" 
                   :method :post 
                   :parameters `(("input" . ,sentence)
                                 ("task" . "Analyze")
                                 ("roots" . "sentences")
                                 ("output" . "mrs")
                                 ("exhaustivep" . "best")
                                 ("nresults" . "1"))))

And here's the function I am having trouble with

(defun get-mrs (sentence)
    (let* (
       (str (send-request sentence))
       (document (chtml:parse str (cxml-stp:make-builder))))
      (stp:filter-recursively (stp:of-name "mrsFeatureTop") document)))

Basically all the data I need to extract is in an html table, it's too big to paste here though. In my get-mrs function, i was just trying to get the tag with name mrsFeatureTop, I am not sure if this is correct though since I am getting an error: not an NCName 'onclick. Any help with scraping the table will be greatly appreciated. Thank you.

Upvotes: 1

Views: 2007

Answers (1)

Vr Rm
Vr Rm

Reputation: 41

Ancient question, I know. But one that that defeated me for a long time. It's true that a lot of webpages are rubish, but nearly the entire 2.0 is build upon screen scraping, integrating heterogeneous websites with hack upon hack -- should be an ideal application for Lisp!

The key (in addition to drakma) is lquery which allows you to access the pages contents using a lispy transliteration of css selectors (what jquery uses).

Let's get the links from the media strip on Google's news page! If you open https://news.google.com in a browser and view source. You'll be overwhelmed by the complexity of the page. But if you view the page in the browsers development panel (Firefox: F12, Inspector) You'll see the page has some logic to it. Use the search box to find .media-strip-table That element contains the images we want. Now open your favourite repl. (Well, let's be honest here, Emacs: M-x slime)

(ql:quickload '(:drakma :lquery))

;;; Get the links from the media strip on Google's news page.
(defparameter response  (drakma:http-request "https://news.google.com/"))

;;; lquery parses the page and gets it ready to be queried.
(lquery:$ (initialize http-response))

Now let's explore the results

;;; package qualified '$' opperator, Barbaric!  
;;; Use (use-package :lquery) to omit the package prefix.
(lquery:$ ".media-strip-table" (html))

Wow! that's just a tiny section of the page? Ok, how about the first element?

(elt (lquery:$ ".media-strip-table" (html)) 0)

OK, that's a little more manageable. Let's see if there's an image tag in there somewhere, Emacs: C-s img Yay! There it is.

(lquery:$ ".media-strip-table img" (html))

Hmmm... It's finding something, but only returning empty text... Oh yeah, image tags are supposed to be empty!

(lquery:$ ".media-strip-table img" (attr :src))

Holy crap! gif's aren't just used for unfunny, grainy animations?

Upvotes: 3

Related Questions