bloudermilk
bloudermilk

Reputation: 18129

What is the best way to parse all the URLs out of a string of HTML?

I'm writing a web crawler in Node for fun over the next couple weeks. In my prototype, I was using jsdom to jquerify the page, then searching for all the anchors and adding the hrefs to my crawl list. I realized that I could potentially find a lot more URLs if I just parsed any URL out of the source (URLs in text, for example). I'm wondering if there's any good javascript libraries out there to do this using regex or otherwise.

As a side note: Is this a bad idea?

Update:

Although I originally selected Chris' answer below I was a bit trigger happy as is turns out. Unfortunately I didn't end up using node.io. I found it to be a little bloated and it doesn't really focus on what I was attempting to do. At the moment I'm using soupselect + htmlparser to grab the href values of any anchors on the page and I'm happy with this solution for the time being.

Upvotes: 1

Views: 235

Answers (2)

Chris Fulstow
Chris Fulstow

Reputation: 41902

Check out node.io, it's an excellent scraping and processing framework for node.js.

Or, it's also possible to use YUI3 to parse and manipulate the an HTML document from node.

Upvotes: 2

Niet the Dark Absol
Niet the Dark Absol

Reputation: 324790

When looking for URLs, I use this regex: /(https?:\/\/)([^.\/]+(?:\.[^.\/]+)+)(\/.*)/

You then have sub-patterns:

  1. Protocol
  2. Domain
  3. Path

Not sure how well it'd work for a crawler, but it's never failed me yet.

Upvotes: 1

Related Questions