SooDesuNe
SooDesuNe

Reputation: 10030

Regex to verify content between HTML tags

I'm struggling to come up with a RegEx will confirm that some text exists between two tags. Specifically, I want to ensure that that the text "TOTAL" and "$19.00" can be found within the same table row.

I'm not planing to nest tables, so I'm not worried about a nested match, but I do want to make sure that my text is within the SAME tr

My HTML:

<tr style='text-align:right;'>
  <td>shipping:</td>
  <td style='padding-left:3em;'>$17.00</td>
</tr>
<tr style='text-align:right;'>
  <td>TOTAL:</td>
  <td style='padding-left:3em;'>$19.00</td>
</tr>

Regular Expression I tried:

/<tr\b[^>]*>(.*?)<\/tr>/m

It's close, the second capture group has my content. What do I need to change so only the second capture group is matched?

You can play with it on Rubular here

Upvotes: 1

Views: 985

Answers (2)

mu is too short
mu is too short

Reputation: 434665

I think an HTML parser and a bit of XPath would be a better call than a regex. Something like this:

shipping = '//td[normalize-space(text())="shipping:"]/following-sibling::td[normalize-space(text())]'
total    = '//td[normalize-space(text())="TOTAL:"]/following-sibling::td[normalize-space(text())]'
doc = Nokogiri::HTML <<HTML
  <tr style='text-align:right;'>
    <td>  shipping:    </td>
    <td style='padding-left:3em;'>$17.00</td>
  </tr>
  <tr style='text-align:right;'>
    <td>TOTAL:</td>
    <td style='padding-left:3em;'>$19.00</td>
  </tr>
HTML
has_shipping = doc.xpath(shipping).count == 1 # true
has_total    = doc.xpath(total   ).count == 1 # true

But without the $17.00 and $19.00:

doc = Nokogiri::HTML <<HTML
  <tr style='text-align:right;'>
    <td>  shipping:    </td>
    <td style='padding-left:3em;'>    </td>
  </tr>
  <tr style='text-align:right;'>
    <td>TOTAL:</td>
    <td style='padding-left:3em;'></td>
  </tr>
HTML
has_shipping = doc.xpath(shipping).count == 1 # false
has_total    = doc.xpath(total   ).count == 1 # false

If you want to verify the format of the price as well then you can find just the <td>s you want and apply whatever Enumerable methods make sense in your situation:

shipping = '//td[normalize-space(text())="shipping:"]/following-sibling::td'
good_one = doc.xpath(shipping).count { |n| n.content =~ /\A\s*\$\d+\.\d{2}\s*\z/ } == 1

Upvotes: 2

renato
renato

Reputation: 125

<tr.*?>\s*?<td.*?>TOTAL:<\/td>\s*?<td.*?>\$19\.00<\/td>\s*?<\/tr>

Upvotes: 2

Related Questions