Deleteman
Deleteman

Reputation: 8690

Getting all matches for a regexp on clojure

I'm trying to parse an HTML file and get all href's inside it.

So far, the code I'm using is:

(map 
   #(println (str "Match: " %)) 
   (re-find #"(?sm)href=\"([a-zA-Z.:/]+)\"" str_response))

str_response being the string with the HTML code inside it. According to my basic understanding of Clojure, that code should print a list of matches, but so far, no luck. It doens't crash, but it doens't match anything either. I've tried using re-seq instead of re-find, but with no luck. Any help?

Thanks!

Upvotes: 3

Views: 683

Answers (3)

jbear
jbear

Reputation: 363

I don't think there is anything wrong with your code. Perhapsstr_responseis the suspect. The following works with http://google.com with your regex:

(let [str_response (slurp "http://google.com")]
  (map #(println (str "Match: " %)) 
   (re-seq #"(?sm)href=\"([a-zA-Z.:/]+)\"" str_response))

Note ref-find also works though it only returns one match.

Upvotes: 2

Julien Chastang
Julien Chastang

Reputation: 17774

This really looks like an HTML scraping problem in which case, I would advise using enlive.

Something like this should work

(ns test.foo
  (:require [net.cgrand.enlive-html :as html]))

(let [url (html/html-resource
           (java.net.URL. "http://www.nytimes.com"))]
  (map #(-> % :attrs :href) (html/select url [:a])))

Upvotes: 3

Arthur Ulfeldt
Arthur Ulfeldt

Reputation: 91544

it is generally though that you cannot parse html with a regex (entertaining answer), though just finding all occurances of one tag should be dooable.

once you figure out the proper regex re-seq is the function you want to use:

user> (re-find #"aa" "aalkjkljaa")
"aa"
user> (re-seq #"aa" "aalkjkljaa")
("aa" "aa")

this is not crashing for you because re-find is returning nil which map is interpreting as an empty list and doing nothing

Upvotes: 4

Related Questions