Shubham
Shubham

Reputation: 22307

RegEx Not working in Ruby!

I am using the following regex

html.scan(Regexp.new(/Name:<\/td>(.*?)<\/td>/s))

to match the name [ Burkhart, Peterson &amp; Company ] in this

<td class="generalinfo_left" align="right">Name:</td>
<td class="generalinfo_right">Burkhart, Peterson &amp; Company</td>

Upvotes: 0

Views: 1013

Answers (5)

tmorse
tmorse

Reputation: 53

You can verify that all the answers suggesting you add /m or Regexp::MULTILINE are correct by going to rubular.com.

I also verified the solution in console, and also modifed the regex so that it would return only the name instead of all the extra junk.

    Loading development environment (Rails 2.3.8)
ree-1.8.7-2010.02 > html = '<td class="generalinfo_left" align="right">Name:</td>
ree-1.8.7-2010.02'> <td class="generalinfo_right">Burkhart, Peterson &amp; Company</td>
ree-1.8.7-2010.02'> '
 => "<td class="generalinfo_left" align="right">Name:</td>\n<td class="generalinfo_right">Burkhart, Peterson &amp; Company</td>\n" 
ree-1.8.7-2010.02 > html.scan(Regexp.new(/Name:<\/td>(.*?)<\/td>/m))
 => [["\n<td class="generalinfo_right">Burkhart, Peterson &amp; Company"]] 
ree-1.8.7-2010.02 > html.scan(Regexp.new(/Name:<\/td>.*<td[^>]*>(.*?)<\/td>/m))
 => [["Burkhart, Peterson &amp; Company"]] 
ree-1.8.7-2010.02 > 

Upvotes: 0

Lee Jarvis
Lee Jarvis

Reputation: 16241

Generally parsing (X)HTML using Regular Expressions is bad practice. Ruby has the fantastic Nokogiri Library which uses libxml2 for parsing XHTML efficiently.

Which that being said, your . does not match newlines. Use the m modifier for your regexp which tells the . to match new lines. Or the Regexp::MULTILINE constant. Documented here

Your regular expression is also capturing the HTML before the text you require.

Using nokogiri and XPath would mean you could grab the content of this table cell by referring to its CSS class. Like this:

#!/usr/bin/env ruby

require 'nokogiri'

doc = Nokogiri::HTML DATA.read

p doc.at("td[@class='generalinfo_right']").text

__END__
<td class="generalinfo_left" align="right">Name:</td>
<td class="generalinfo_right">Burkhart, Peterson &amp; Company</td>

Which will return "Burkhart, Peterson & Company"

Upvotes: 4

avpaderno
avpaderno

Reputation: 29669

html.scan(Regexp.new(/Name:<\/td>(.*?)<\/td>/s)) doesn't match the new line characters; even if it would match those characters, the (.*?) part would grab everything after </td>, including <td class="generalinfo_right">.

To make the regular expression more generic, and allow to match the exact text you want, you should change the code to

html.scan(Regexp.new(/Name:<\/td><td[^>]*>(.*?)<\/td>/s))

The regular expression could be better written, though.

I would also not suggest to parse HTML/XHTML content with regular expression.

Upvotes: 0

commondream
commondream

Reputation: 504

You'll want to use /m for multiline mode:

str.scan(/Name:</td>(.*?)</td>/m)

Upvotes: 0

Wrikken
Wrikken

Reputation: 70460

/m makes the dot match newlines

Upvotes: 2

Related Questions