Pyderman
Pyderman

Reputation: 16227

Extracting text from a span with lxml?

Given:

import urllib2
from lxml import etree

url =  "http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Under+Armour+Dauntless+Backpack&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)

where the URL is a standard Ebay search results page with some filtering applied:

enter image description here

I'm looking to extract the the product prices e.g. $40.00, $34.95 etc.

There are a few possible XPaths (as provided by Firebug, the XPath Checker Firefox add-on, and a manual inspection of the source):

/html/body/div[5]/div[2]/div[3]/div/div[1]/div/div[3]/div/div[1]/div/w-root/div/div/ul/li[1]/ul[1]/li[1]/span
id('item3d00cf865e')/x:ul[1]/x:li[1]/x:span
//span[@class ='bold bidsold']

Choosing the latter:

xpathselector="//span[@class ='bold bidsold']"

tree.xpath(xpathselector) then returns a list of Element objects, as expected. When I get their .text attributes, I would have expected to get the prices. But what I get is:

In [17]: tree.xpath(xpathselector)
Out[17]: 
['\n\t\t\t\t\t',
 u' 1\xc2\xa0103.78',
 '\n\t\t\t\t\t',
 u' 1\xc2\xa0048.28',
 '\n\t\t\t\t\t',
 ' 964.43',
 '\n\t\t\t\t\t',
 ' 922.43',
 '\n\t\t\t\t\t',
 ' 922.43',
 '\n\t\t\t\t\t',
 ' 275.67',
 '\n\t\t\t\t\t',

The values contained within each look like prices, but (i) the prices are substantially higher than what are displayed on the web page, (ii) I'm wondering what all the newlines and tabs are doing there. Is there something I'm fundamentally wrong here in trying to extract the prices?

I usually use WebDriver for this sort of thing, and take advantage of finding elements by css selector, xpath and class. But in this case, I want no browser interaction, which is why I am going with urllib2 and lxml for the first time.

etc.

Upvotes: 1

Views: 2818

Answers (2)

Dmytro Pastovenskyi
Dmytro Pastovenskyi

Reputation: 5429

I see 2 possible cases:

  1. It looks like ebay check your locale and convert price based on currency in your country. Once you open page via browser it may read some browser settings and once you execute code it can read settings from somewhere else.
  2. The prices may be adjusted by ebay using javascript (client side) so you can't catch that with your parser.

I would recommend to check next:

  1. Check what currency you have when you run code
  2. Check source of page and confirm that prices there exactly same as you see in browser.

Upvotes: 1

Randomazer
Randomazer

Reputation: 172

I write two examples on python

Example 1:

import urllib2
from lxml import etree

if __name__ == '__main__':
    url =  "http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Under+Armour+Dauntless+Backpack&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000"
    response = urllib2.urlopen(url)
    htmlparser = etree.HTMLParser()
    tree = etree.parse(response, htmlparser)
    xpathselector="//span[@class ='bold bidsold']"
    for i in tree.xpath(xpathselector):
        print "".join(filter(lambda x: ord(x)<64, i.text)).strip()

Example 2:

import urllib2
from lxml import etree

if __name__ == '__main__':
    url =  "http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Under+Armour+Dauntless+Backpack&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000"
    response = urllib2.urlopen(url)
    htmlparser = etree.HTMLParser()
    tree = etree.parse(response, htmlparser)
    xpathselector="//span[@class ='bold bidsold']|//span[@class='sboffer']"
    for i in tree.xpath(xpathselector):
        print "".join(filter(lambda x: ord(x)<64, i.text)).strip()

Upvotes: 1

Related Questions