Reputation: 22307
I am using the following regex
html.scan(Regexp.new(/Name:<\/td>(.*?)<\/td>/s))
to match the name [ Burkhart, Peterson & Company
] in this
<td class="generalinfo_left" align="right">Name:</td>
<td class="generalinfo_right">Burkhart, Peterson & Company</td>
Upvotes: 0
Views: 1013
Reputation: 53
You can verify that all the answers suggesting you add /m or Regexp::MULTILINE are correct by going to rubular.com.
I also verified the solution in console, and also modifed the regex so that it would return only the name instead of all the extra junk.
Loading development environment (Rails 2.3.8)
ree-1.8.7-2010.02 > html = '<td class="generalinfo_left" align="right">Name:</td>
ree-1.8.7-2010.02'> <td class="generalinfo_right">Burkhart, Peterson & Company</td>
ree-1.8.7-2010.02'> '
=> "<td class="generalinfo_left" align="right">Name:</td>\n<td class="generalinfo_right">Burkhart, Peterson & Company</td>\n"
ree-1.8.7-2010.02 > html.scan(Regexp.new(/Name:<\/td>(.*?)<\/td>/m))
=> [["\n<td class="generalinfo_right">Burkhart, Peterson & Company"]]
ree-1.8.7-2010.02 > html.scan(Regexp.new(/Name:<\/td>.*<td[^>]*>(.*?)<\/td>/m))
=> [["Burkhart, Peterson & Company"]]
ree-1.8.7-2010.02 >
Upvotes: 0
Reputation: 16241
Generally parsing (X)HTML using Regular Expressions is bad practice. Ruby has the fantastic Nokogiri Library which uses libxml2 for parsing XHTML efficiently.
Which that being said, your .
does not match newlines. Use the m
modifier for your regexp which tells the .
to match new lines. Or the Regexp::MULTILINE constant. Documented here
Your regular expression is also capturing the HTML before the text you require.
Using nokogiri and XPath would mean you could grab the content of this table cell by referring to its CSS class. Like this:
#!/usr/bin/env ruby
require 'nokogiri'
doc = Nokogiri::HTML DATA.read
p doc.at("td[@class='generalinfo_right']").text
__END__
<td class="generalinfo_left" align="right">Name:</td>
<td class="generalinfo_right">Burkhart, Peterson & Company</td>
Which will return "Burkhart, Peterson & Company"
Upvotes: 4
Reputation: 29669
html.scan(Regexp.new(/Name:<\/td>(.*?)<\/td>/s))
doesn't match the new line characters; even if it would match those characters, the (.*?)
part would grab everything after </td>
, including <td class="generalinfo_right">
.
To make the regular expression more generic, and allow to match the exact text you want, you should change the code to
html.scan(Regexp.new(/Name:<\/td><td[^>]*>(.*?)<\/td>/s))
The regular expression could be better written, though.
I would also not suggest to parse HTML/XHTML content with regular expression.
Upvotes: 0
Reputation: 504
You'll want to use /m for multiline mode:
str.scan(/Name:</td>(.*?)</td>/m)
Upvotes: 0