Reputation: 8690
I'm trying to parse an HTML file and get all href's inside it.
So far, the code I'm using is:
(map
#(println (str "Match: " %))
(re-find #"(?sm)href=\"([a-zA-Z.:/]+)\"" str_response))
str_response being the string with the HTML code inside it. According to my basic understanding of Clojure, that code should print a list of matches, but so far, no luck.
It doens't crash, but it doens't match anything either.
I've tried using re-seq
instead of re-find
, but with no luck. Any help?
Thanks!
Upvotes: 3
Views: 683
Reputation: 363
I don't think there is anything wrong with your code. Perhapsstr_response
is the suspect. The following works with http://google.com with your regex:
(let [str_response (slurp "http://google.com")]
(map #(println (str "Match: " %))
(re-seq #"(?sm)href=\"([a-zA-Z.:/]+)\"" str_response))
Note ref-find
also works though it only returns one match.
Upvotes: 2
Reputation: 17774
This really looks like an HTML scraping problem in which case, I would advise using enlive.
Something like this should work
(ns test.foo
(:require [net.cgrand.enlive-html :as html]))
(let [url (html/html-resource
(java.net.URL. "http://www.nytimes.com"))]
(map #(-> % :attrs :href) (html/select url [:a])))
Upvotes: 3
Reputation: 91544
it is generally though that you cannot parse html with a regex (entertaining answer), though just finding all occurances of one tag should be dooable.
once you figure out the proper regex re-seq
is the function you want to use:
user> (re-find #"aa" "aalkjkljaa")
"aa"
user> (re-seq #"aa" "aalkjkljaa")
("aa" "aa")
this is not crashing for you because re-find is returning nil which map is interpreting as an empty list and doing nothing
Upvotes: 4