Reputation: 305

Regex in Ruby for a URL that is an image

So I'm working on a crawler to get a bunch of images on a page that are saved as links. The relevant code, at the moment, is:

def parse_html(html)
  html_doc = Nokogiri::HTML(html)
  nodes = html_doc.xpath("//a[@href]")
  nodes.inject([]) do |uris, node|
    uris << node.attr('href').strip
  end.uniq
end

I am current getting a bunch of links, most of which are images, but not all. I want to narrow down the links before downloading with a regex. So far, I haven't been able to come up with a Ruby-Friendly regex for the job. The best I have is:

^https?:\/\/(?:[a-z0-9\-]+\.)+[a-z]{2,6}(?:/[^\/?]+)+\.(?:jpg|gif|png)$.match(nodes)

Admittedly, I got that regex from someone else, and tried to edit it to work and I'm failing. One of the big problems I'm having is the original Regex I took had a few "#"'s in it, which I don't know if that is a character I can escape, or if Ruby is just going to stop reading at that point. Help much appreciated.

Upvotes: 0

Answers (3)

jacobherrington

Reputation: 463

As some have said, you may not want to use Regex for this, but if you're determined to:

^http(s?):\/\/.*\.(jpeg|jpg|gif|png)

Is a pretty simple one that will grab anything beginning with http or https and ending with one of the file extensions listed. You should be able to figure out how to extend this one, Rubular.com is good for experimenting with these.

Upvotes: 1

Mark Thomas

Reputation: 37527

I would consider modifying your XPath to include your logic. For example, if you only wanted the a elements that contained an img you can use the following:

"//a[img][@href]"

Or even go further and extract just the URIs directly from the href values:

uris = html_doc.xpath("//a[img]/@href").map(&:value)

Upvotes: 1

spickermann

Reputation: 107142

Regexp is a very powerful tool but - compared to simple string comparisons - they are pretty slow.

For your simple example, I would suggest using a simple condition like:

IMAGE_EXTS = %w[gif jpg png]
if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
  # ...

In the context of your question, you might want to change your method to:

IMAGE_EXTS = %w[gif jpg png]

def parse_html(html)
  uris = []

  Nokogiri::HTML(html).xpath("//a[@href]").each do |node|
    uri = node.attr('href').strip
    uris << uri if IMAGE_EXTS.any? { |ext| uri.end_with?(ext) }
  end

  uris.uniq
end

Upvotes: 0

Regex in Ruby for a URL that is an image

Answers (3)

Related Questions