Reputation: 2188
My question is related to another question found here Scraping an HTML table in Common Lisp?
I am trying to extract data from a webpage in common lisp. I am currently using drakma to send the http request, and I'm trying to use chtml to extract the data I am looking for. The webpage I'm trying to scrap is http://erg.delph-in.net/logon, here is my code
(defun send-request (sentence)
"sends sentence in an http request to logon for parsing, and recieves
back the webpage containing the MRS output"
(drakma:http-request "http://erg.delph-in.net/logon"
:method :post
:parameters `(("input" . ,sentence)
("task" . "Analyze")
("roots" . "sentences")
("output" . "mrs")
("exhaustivep" . "best")
("nresults" . "1"))))
And here's the function I am having trouble with
(defun get-mrs (sentence)
(let* (
(str (send-request sentence))
(document (chtml:parse str (cxml-stp:make-builder))))
(stp:filter-recursively (stp:of-name "mrsFeatureTop") document)))
Basically all the data I need to extract is in an html table, it's too big to paste here though. In my get-mrs function, i was just trying to get the tag with name mrsFeatureTop, I am not sure if this is correct though since I am getting an error: not an NCName 'onclick. Any help with scraping the table will be greatly appreciated. Thank you.
Upvotes: 1
Views: 2007
Reputation: 41
Ancient question, I know. But one that that defeated me for a long time. It's true that a lot of webpages are rubish, but nearly the entire 2.0 is build upon screen scraping, integrating heterogeneous websites with hack upon hack -- should be an ideal application for Lisp!
The key (in addition to drakma) is lquery which allows you to access the pages contents using a lispy transliteration of css selectors (what jquery uses).
Let's get the links from the media strip on Google's news page! If you open https://news.google.com in a browser and view source. You'll be overwhelmed by the complexity of the page. But if you view the page in the browsers development panel (Firefox: F12, Inspector) You'll see the page has some logic to it. Use the search box to find .media-strip-table That element contains the images we want. Now open your favourite repl. (Well, let's be honest here, Emacs: M-x slime
)
(ql:quickload '(:drakma :lquery))
;;; Get the links from the media strip on Google's news page.
(defparameter response (drakma:http-request "https://news.google.com/"))
;;; lquery parses the page and gets it ready to be queried.
(lquery:$ (initialize http-response))
Now let's explore the results
;;; package qualified '$' opperator, Barbaric!
;;; Use (use-package :lquery) to omit the package prefix.
(lquery:$ ".media-strip-table" (html))
Wow! that's just a tiny section of the page? Ok, how about the first element?
(elt (lquery:$ ".media-strip-table" (html)) 0)
OK, that's a little more manageable. Let's see if there's an image tag in there somewhere, Emacs: C-s img
Yay! There it is.
(lquery:$ ".media-strip-table img" (html))
Hmmm... It's finding something, but only returning empty text... Oh yeah, image tags are supposed to be empty!
(lquery:$ ".media-strip-table img" (attr :src))
Holy crap! gif's aren't just used for unfunny, grainy animations?
Upvotes: 3