Reputation: 1

Parsing prices from HTML returns blank or no value?

I'm not seeing any success in this, as no matter what I do it returns blank values only.

Here is my code:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

PAGE_URL = "http://www.oficinadosbits.com.br/produto18064/EVGA_GeForce_GTX_980_04G-P4-2983-KR.html"

page = Nokogiri::HTML(open(PAGE_URL))

price = page.xpath("/html/body/div[1]/div/div/table[1]/tbody/tr[1]/td[3]/table[1]/tbody/tr[2]/td[3]/table[1]/tbody/tr/td[1]/font[2]/table/tbody/tr[1]/td/span").text

puts price

I tried using CSS and also Mechanize, but without success:

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'mechanize'

mechanize = Mechanize.new

page = mechanize.get("http://www.oficinadosbits.com.br/produto18064/EVGA_GeForce_GTX_980_04G-P4-2983-KR.html")

doc = page.parser

price = doc.xpath("/html/body/div[1]/div/div/table[1]/tbody/tr[1]/td[3]/table[1]/tbody/tr[2]/td[3]/table[1]/tbody/tr/td[1]/font[2]/table/tbody/tr[1]/td/span").text

puts price

When I use:

puts price.size

at the end it returns a zero. How come it reads zero values?

I'd like to understand why this is happening, and how I could solve it in order to be able to parse the prices.

I got the xpath from Firebug's "copy xpath" option.

Upvotes: 0

Answers (4)

the Tin Man

Reputation: 160581

Nokogiri supports both XPath and CSS selectors. I generally go with CSS for readability and simplicity, but XPath is important to know also as it has a lot of power.

Consider this code:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p id="p1" class="paragraphs" foo="bar">some text</p>
    <p id="p2" class="paragraphs" foo="baz">some text</p>
  </body>
</html>
EOT

We can find nodes by their tag:

p_nodes = doc.search('p')
p_nodes.class             # => Nokogiri::XML::NodeSet
p_nodes.size              # => 2
p_nodes.map(&:to_html)    # => ["<p id=\"p1\" class=\"paragraphs\" foo=\"bar\">some text</p>", "<p id=\"p2\" class=\"paragraphs\" foo=\"baz\">some text</p>"]

Using search returns a NodeSet, which is akin to an Array. In this example it found two <p> tags.

Compare that to using at:

p_node = doc.at('p') 
p_node.class                            # => Nokogiri::XML::Element
p_node = doc.at('p#p2').to_html         # => "<p id=\"p2\" class=\"paragraphs\" foo=\"baz\">some text</p>"
p_node = doc.at('p.paragraphs').to_html # => "<p id=\"p1\" class=\"paragraphs\" foo=\"bar\">some text</p>"

at is the equivalent to taking the first element found by search, but it returns an Element/Node. A node could contain more nodes/tags, and a NodeSet would be an array of nodes, and in all cases, a node is like a pointer into the document which is useful to navigating around.

doc.at('body').at('p') # => #<Nokogiri::XML::Element:0x3fd431448c6c name="p" attributes=[#<Nokogiri::XML::Attr:0x3fd431448c08 name="id" value="p1">, #<Nokogiri::XML::Attr:0x3fd431448bf4 name="class" value="paragraphs">, #<Nokogiri::XML::Attr:0x3fd431448be0 name="foo" value="bar">] children=[#<Nokogiri::XML::Text:0x3fd431448384 "some text">]>
doc.at('body > p')     # => #<Nokogiri::XML::Element:0x3fd431448c6c name="p" attributes=[#<Nokogiri::XML::Attr:0x3fd431448c08 name="id" value="p1">, #<Nokogiri::XML::Attr:0x3fd431448bf4 name="class" value="paragraphs">, #<Nokogiri::XML::Attr:0x3fd431448be0 name="foo" value="bar">] children=[#<Nokogiri::XML::Text:0x3fd431448384 "some text">]>
doc.at('p')            # => #<Nokogiri::XML::Element:0x3fd431448c6c name="p" attributes=[#<Nokogiri::XML::Attr:0x3fd431448c08 name="id" value="p1">, #<Nokogiri::XML::Attr:0x3fd431448bf4 name="class" value="paragraphs">, #<Nokogiri::XML::Attr:0x3fd431448be0 name="foo" value="bar">] children=[#<Nokogiri::XML::Text:0x3fd431448384 "some text">]>

Notice how the address of the node

Nokogiri::XML::Element:0x3fd431448c6c

remains the same in the above results.

Building on top of that, to find nodes in the HTML we can navigate using various parameters:

doc.at('p[foo="baz"]').to_html # => "<p id=\"p2\" class=\"paragraphs\" foo=\"baz\">some text</p>"

doc.search('p[foo="baz"]').size # => 1
doc.search('p[foo="baz"]').first.to_html # => "<p id=\"p2\" class=\"paragraphs\" foo=\"baz\">some text</p>"

The take-away of this is, we should inspect the HTML, find specific nodes that get us to the information we want, then write the minimum selector to get there. Long selectors are more likely to break if the HTML changes.

Finally, beware of using browsers to inspect the code as they mess with tables. I created a file containing:

<html>
  <body>
    <table>
      <tr>
        <td>foo</td>
      </tr>
    </table>
  </body>
</html>

Opening it in Firefox, Opera or Safari and inspecting the page resulted in HTML that had been modified:

<html>
  <head></head>
  <body>
    <table>
      <tbody>
        <tr>
          <td>foo</td>
        </tr>
      </tbody>
    </table>
  </body>
</html>

Don't trust the browser and instead use wget, curl or Nokogiri's own command-line:

$ nokogiri http://example.com
Your document is stored in @doc...
irb(main):001:0> @doc
=> #<Nokogiri::HTML::Document:0x3fd748d60740 name="document" children=[#<Nokogiri::XML::DTD:0x3fd748d41a5c name="html">, #<Nokogiri::XML::Element:0x3fd748d41750 name="html" children=[#<Nokogiri::XML::Text:0x3fd748d41534 "\n">, #<Nokogiri::XML::Element:0x3fd748d41430 name="head" children=[#<Nokogiri::XML::Text:0x3fd748d41200 "\n    ">, #<Nokogiri::XML::Element:0x3fd748d41138 name="title" children=[#<Nokogiri::XML::Text:0x3fd748d40f44 "Example Domain">]>, #<Nokogiri::XML::Text:0x3fd748d40d78 "\n\n    ">, #<Nokogiri::XML::Element:0x3fd748d40cb0 name="meta" attributes=[#<Nokogiri::XML::Attr:0x3fd748d40c4c name="charset" value="utf-8">]>, #<Nokogiri::XML::Text:0x3fd748d40544 "\n    ">, #<Nokogiri::XML::Element:0x3fd748d40454 name="meta" attributes=[#<Nokogiri::XML::Attr:0x3fd748d403dc name="http-equiv" value="Content-type">, #<Nokogiri::XML::Attr:0x3fd748d403c8 name="content" value="text/html; charset=utf-8">]>, #<Nokogiri::XML::Text:0x3fd748d3d934 "\n    ">, #<Nokogiri::XML::Element:0x3fd748d3d858 name="meta" attributes=[#<Nokogiri::XML::Attr:0x3fd748d3d7a4 name="name" value="viewport">, #<Nokogiri::XML::Attr:0x3fd748d3d77c name="content" value="width=device-width, initial-scale=1">]>, #<Nokogiri::XML::Text:0x3fd748d3ce1c "\n    ">, #<Nokogiri::XML::Element:0x3fd748d3cd68 name="style" attributes=[#<Nokogiri::XML::Attr:0x3fd748d3cd04 name="type" value="text/css">] children=[#<Nokogiri::XML::CDATA:0x3fd748d3c4bc "\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: \"Open Sans\", \"Helvetica Neue\", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 50px;\n        background-color: #fff;\n        border-radius: 1em;\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        body {\n            background-color: #fff;\n        }\n        div {\n            width: auto;\n            margin: 0 auto;\n            border-radius: 0;\n            padding: 1em;\n        }\n    }\n    ">]>, #<Nokogiri::XML::Text:0x3fd748d3c0e8 "    \n">]>, #<Nokogiri::XML::Text:0x3fd748d39e38 "\n\n">, #<Nokogiri::XML::Element:0x3fd748d39d48 name="body" children=[#<Nokogiri::XML::Text:0x3fd748d39adc "\n">, #<Nokogiri::XML::Element:0x3fd748d39a00 name="div" children=[#<Nokogiri::XML::Text:0x3fd748d397bc "\n    ">, #<Nokogiri::XML::Element:0x3fd748d39460 name="h1" children=[#<Nokogiri::XML::Text:0x3fd748d39118 "Example Domain">]>, #<Nokogiri::XML::Text:0x3fd748d38f60 "\n    ">, #<Nokogiri::XML::Element:0x3fd748d38e84 name="p" children=[#<Nokogiri::XML::Text:0x3fd748d38c7c "This domain is established to be used for illustrative examples in documents. You may use this\n    domain in examples without prior coordination or asking for permission.">]>, #<Nokogiri::XML::Text:0x3fd748d38ab0 "\n    ">, #<Nokogiri::XML::Element:0x3fd748d389fc name="p" children=[#<Nokogiri::XML::Element:0x3fd748d38808 name="a" attributes=[#<Nokogiri::XML::Attr:0x3fd748d387a4 name="href" value="http://www.iana.org/domains/example">] children=[#<Nokogiri::XML::Text:0x3fd748d38344 "More information...">]>]>, #<Nokogiri::XML::Text:0x3fd748d38088 "\n">]>, #<Nokogiri::XML::Text:0x3fd748d35eb4 "\n">]>, #<Nokogiri::XML::Text:0x3fd748d35cfc "\n">]>]>
irb(main):002:0> @doc.at('a').to_html
=> "<a href=\"http://www.iana.org/domains/example\">More information...</a>"
irb(main):003:0> @doc.at('a')['href']
=> "http://www.iana.org/domains/example"

Upvotes: 0

Sukanta

Reputation: 585

If you want to get the price of:

"GeForce GTX 980 4GB GDDR5 256bits - Game Grátis - EVGA 04G-P4-2983-KR"

you can use Nokogiri with a CSS selector:

doc = Nokogiri::HTML(open("http://www.oficinadosbits.com.br/produto18064/EVGA_GeForce_GTX_980_04G-P4-2983-KR.html"))
price = doc.css("html > body > div")[0].css("div > div > table[1] > tr")[0].css("td[3] > table")[1].css("tr > td")[1].css("span")[0].text

Upvotes: -1

pguardiario

Reputation: 55002

There's tons of helpful css on that page:

page.at('[itemprop=price]').text
#=> "R$ 3.459,90"

Upvotes: 2

Amadan

Reputation: 198436

There is no tbody in /html/body/div[1]/div/div/table[1]. But you could have checked that yourself.

page.xpath("/html/body/div[1]/div/div/table[1]")
# => lots of output
page.xpath("/html/body/div[1]/div/div/table[1]/tbody")
# => whoopsie.

The issue is, FireBug's "Copy XPath" will give you the XPath for the DOM as it is in the browser at the moment when you requested it, which can differ from the DOM of the source document for various reasons: e.g. DOM changed by JavaScript, or certain nodes automatically inserted by the browser.

Upvotes: 1

Parsing prices from HTML returns blank or no value?

Answers (4)

Related Questions