Reputation:
I have a URL checker that I use in Perl. I was wondering how something like this would be done in Clojure. I have a file with thousands of URLs and I'd like the output file to contain the URL (minus http://, https://) and a simple :1 for valid and :0 for false. Ideally, I could check each site concurrently, considering that this is one of Clojure's strengths.
http://www.google.com
http://www.cnn.com
http://www.msnbc.com
http://www.abadurlisnotgood.com
www.google.com:1
www.cnn.com:1
www.msnbc.com:1
www.abadurlisnotgood.com:0
Upvotes: 2
Views: 1649
Reputation: 18005
Clojure now has a as-url
function in clojure.java.io
:
(as-url "http://google.com") ;;=> #object[java.net.URL 0x5dedf9bd "http://google.com"]
(str (as-url "http://google.com")) ;;=> "http://google.com"
(as-url "notanurl") ;; java.net.MalformedURLException
Based on that we could write a function like so:
(defn check-url
"checks if the url is well formed"
[url]
(str (clojure.string/replace-first url #"(http://|https://)" "")
":"
(try (as-url url) ;; built-in, does not perform an actual request, and does very little validation
1
(catch Exception e 0))))
(defn check-urls-from-file
"from Brian Carper answer"
[filename]
(doseq [line (map check-url (read-lines (as-file filename)))]
(println line)))
Upvotes: 1
Reputation:
Instead of pmap, I used agents with send-off in conjunction with the above solution. I think this is better when there is blocking I/O. I believe pmap has limited concurrency too. Here's what I have so far. I wonder how this will scale with thousands of URLs.
(use '(clojure.contrib duck-streams
java-utils
str-utils))
(import '(java.net URL
URLConnection
HttpURLConnection
UnknownHostException))
(defn check-url [url]
(str (re-sub #"^(?i)http:/+" "" url)
":"
(try
(let [c (cast HttpURLConnection
(.openConnection (URL. url)))]
(if (= 200 (.getResponseCode c))
1
0))
(catch UnknownHostException _
0))))
(def urls (read-lines "urls.txt"))
(def agents (for [url urls] (agent url)))
(doseq [agent agents]
(send-off agent check-url))
(apply await agents)
(def x '())
(doseq [url (filter deref agents)]
(def x (cons @url x)))
(prn x)
(shutdown-agents)
Upvotes: 0
Reputation: 73006
I assume by "valid URL" you mean HTTP response 200. This might work. It requires clojure-contrib
. Change map
to pmap
to attempt to make it parallel, like Arthur Ulfeldt mentioned.
(use '(clojure.contrib duck-streams
java-utils
str-utils))
(import '(java.net URL
URLConnection
HttpURLConnection
UnknownHostException))
(defn check-url [url]
(str (re-sub #"^(?i)http:/+" "" url)
":"
(try
(let [c (cast HttpURLConnection
(.openConnection (URL. url)))]
(if (= 200 (.getResponseCode c))
1
0))
(catch UnknownHostException _
0))))
(defn check-urls-from-file [filename]
(doseq [line (map check-url
(read-lines (as-file filename)))]
(println line)))
Given your example as input:
user> (check-urls-from-file "urls.txt")
www.google.com:1
www.cnn.com:1
www.msnbc.com:1
www.abadurlisnotgood.com:0
Upvotes: 6
Reputation: 91607
Write a small function that appends a ":1" or ":0" to a url and then use pmap to apply it in parallel to all the urls.
(defn check-a-url [url] .... )
(pmap #(if (check-a-url %) (str url ":1") (str url ":0")))
Upvotes: 3