Mateus Pinheiro
Mateus Pinheiro

Reputation: 870

Problem with Ruby Regular Expression

I have this HTML code, that's on a single line:

<h3 class='r'><a href="www.google.com">fkdsafjldsajl</a></h3><h3 class='r'><a href="www.google.com">fkdsafjldsajl</a></h3>

Here is the line-friendly version (that i can't use)

<h3 class='r'><a href="www.google.com">fkdsafjldsajl</a></h3>
<h3 class='r'><a href="www.google.com">fkdsafjldsajl</a></h3>

And i'm trying to extract just the URLs, with this REGEX

/<h3 class="r"><a href="(.*)">(.*)<\/a>/

And it returns

www.google.com">fkdsafjldsajl</a></h3><h3 class='r'><a href="www.google.com"

What can I do to stop it when find a " ?

Upvotes: 1

Views: 509

Answers (2)

the Tin Man
the Tin Man

Reputation: 160553

Sigh. Regex and HTML are such awkward bedfellows:

require 'nokogiri'

html = %q{<h3 class='r'><a href="www.google.com">fkdsafjldsajl</a></h3><h3 class='r'><a href="www.google.com">fkdsafjldsajl</a></h3>}
doc = Nokogiri::HTML(html)
puts doc.css('a').map{ |a| a['href'] }
# >> www.google.com
# >> www.google.com

This will find them, whether they are deeply nested or all on one line.

Upvotes: 3

Kyle Wild
Kyle Wild

Reputation: 8915

The problem is that * is greedy. Put a question mark after it to make it ungreedy.

Working regex (tested on rubular)

href\=\"(.*?)\"

Upvotes: 3

Related Questions