What is the best way to parse all the URLs out of a string of HTML?

Question

I'm writing a web crawler in Node for fun over the next couple weeks. In my prototype, I was using jsdom to jquerify the page, then searching for all the anchors and adding the hrefs to my crawl list. I realized that I could potentially find a lot more URLs if I just parsed any URL out of the source (URLs in text, for example). I'm wondering if there's any good javascript libraries out there to do this using regex or otherwise.

As a side note: Is this a bad idea?

Update:

Although I originally selected Chris' answer below I was a bit trigger happy as is turns out. Unfortunately I didn't end up using node.io. I found it to be a little bloated and it doesn't really focus on what I was attempting to do. At the moment I'm using soupselect + htmlparser to grab the href values of any anchors on the page and I'm happy with this solution for the time being.

Chris Fulstow · Accepted Answer

Check out node.io, it's an excellent scraping and processing framework for node.js.

Or, it's also possible to use YUI3 to parse and manipulate the an HTML document from node.

What is the best way to parse all the URLs out of a string of HTML?

Answers (2)

Related Questions