signus
signus

Reputation: 1148

Grabbing and/or Parsing HTML Traffic in Ruby with Watir?

In parsing an automating browsers with Watir and Mechanize, I'm coming across a section of data that I want to be able to extract out of a page (and similar data from other pages) that looks like so:

<data>
<somehtmltags>
<tr style="cursor:auto"><td class="hyperlink-first" style="padding-top:20px">Title1</td><td style="padding-top:20px">Data: <br/>Data2: <br/>Data3: <br/>Data4: <br/></td><td style="text-align:center;"><img alt="SomeData" border="0" height="100" src="servlet/Chart?filename=jfreechart-onetime-tmp.png" style="position:static" width="580"/></td></tr>
<tr style="cursor:auto"><td class="hyperlink-first" style="padding-top:20px">Title2</td><td style="padding-top:20px">Data: <br/>Data2: <br/>Data3: <br/>Data4: <br/></td><td style="text-align:center;"><img alt="SomeData" border="0" height="100" src="servlet/Chart?filename=jfreechart-onetime-tmp.png" style="position:static" width="580"/></td></tr>
<somemorehtmltags>
<more data>

My question is, using Watir, Mechanize, Nokogiri or similar methods in Ruby - is there any simpler way I can specify that I want a particular set of matching tags within my HTML code and save that elsewhere?

So in this example I'd like to search for a set of tags with the title "Title1" and save that section of code to a string (including the tags)?

Upvotes: 0

Views: 553

Answers (2)

Justin Ko
Justin Ko

Reputation: 46836

My interpretation of your question, is that you want the html of the cell (td element) adjacent to the cell (td element) with text "Title1". In your example code, this would mean the second td element in the first tr element.

Assuming the interpretation is correct, you can do the following. Note that you can use the .html method on any Watir element to get its html (as a string which you can save to a variable).

#Find the cell with Title1 and then get the second cell in that row
html = browser.td(:text => 'Title1').parent.td(:index => 1).html
#=> "<td style=\"padding-top:20px\">Data: <br>Data2: <br>Data3: <br>Data4: <br></td>"

If you want the entire row, including the title, you can get the parent of Title1 element:

html = browser.td(:text => 'Title1').parent.html
#=> "<tr style=\"cursor:auto\"><td class=\"hyperlink-first\" style=\"padding-top:20px\">Title1</td><td style=\"padding-top:20px\">Data: <br>Data2: <br>Data3: <br>Data4: <br></td><td style=\"text-align:center;\"><img alt=\"SomeData\" src=\"servlet/Chart?filename=jfreechart-onetime-tmp.png\" style=\"position:static\" border=\"0\" height=\"100\" width=\"580\"></td></tr>"

The above assumed that there is only 1 Title1 element on the page you want to get. If there could be multiple, then you will want to create a collection of td elements that have the text Title1 and then collect the sibling element for each. This will give you an array of strings.

html = browser.tds(:text => 'Title1').collect do |td| 
    td.parent.td(:index => 1).html
end
#=> ["<td style=\"padding-top:20px\">Data: <br>Data2: <br>Data3: <br>Data4: <br></td>", 
#=> "<td style=\"padding-top:20px\">Data: <br>Data2: <br>Data3: <br>Data4: <br></td>"]

If Jano's interpretation is correct and you want all of the rows where the title is like "Title" (ie "Title1", "Title2", etc), you can use a regex to do partial text matching. The following would give you each row where the first cell is like Title.

html = browser.tds(:text => /^Title\d$/).collect do |td| 
    td.parent.html
end
#=> ["<tr style=\"cursor:auto\"><td class=\"hyperlink-first\" style=\"padding-top:20px\">Title1</td><td style=\"padding-top:20px\">Data: <br>Data2: <br>Data3: <br>Data4: <br></td><td style=\"text-align:center;\"><img alt=\"SomeData\" src=\"servlet/Chart?filename=jfreechart-onetime-tmp.png\" style=\"position:static\" border=\"0\" height=\"100\" width=\"580\"></td></tr>",
#=> "<tr style=\"cursor:auto\"><td class=\"hyperlink-first\" style=\"padding-top:20px\">Title2</td><td style=\"padding-top:20px\">Data: <br>Data2: <br>Data3: <br>Data4: <br></td><td style=\"text-align:center;\"><img alt=\"SomeData\" src=\"servlet/Chart?filename=jfreechart-onetime-tmp.png\" style=\"position:static\" border=\"0\" height=\"100\" width=\"580\"></td></tr>"]

Upvotes: 3

Pablo G&#243;mez
Pablo G&#243;mez

Reputation: 631

With ruby and watir you can use regular expressions to search for your tags in the html. In your case, you can get the html page using something like:

my_html_container = @browser.html

...and use a regexp and the scan function to get the tags, for example:

my_tags = my_html_container.scan(/(<tr .*)Title\d(.*tr>)/)

You can modify the regexp to get exactly what you want :)

Upvotes: 0

Related Questions