Reputation: 16227
Given:
import urllib2
from lxml import etree
url = "http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Under+Armour+Dauntless+Backpack&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
where the URL is a standard Ebay search results page with some filtering applied:
I'm looking to extract the the product prices e.g. $40.00, $34.95 etc.
There are a few possible XPaths (as provided by Firebug, the XPath Checker Firefox add-on, and a manual inspection of the source):
/html/body/div[5]/div[2]/div[3]/div/div[1]/div/div[3]/div/div[1]/div/w-root/div/div/ul/li[1]/ul[1]/li[1]/span
id('item3d00cf865e')/x:ul[1]/x:li[1]/x:span
//span[@class ='bold bidsold']
Choosing the latter:
xpathselector="//span[@class ='bold bidsold']"
tree.xpath(xpathselector)
then returns a list of Element
objects, as expected. When I get their .text
attributes, I would have expected to get the prices. But what I get is:
In [17]: tree.xpath(xpathselector)
Out[17]:
['\n\t\t\t\t\t',
u' 1\xc2\xa0103.78',
'\n\t\t\t\t\t',
u' 1\xc2\xa0048.28',
'\n\t\t\t\t\t',
' 964.43',
'\n\t\t\t\t\t',
' 922.43',
'\n\t\t\t\t\t',
' 922.43',
'\n\t\t\t\t\t',
' 275.67',
'\n\t\t\t\t\t',
The values contained within each look like prices, but (i) the prices are substantially higher than what are displayed on the web page, (ii) I'm wondering what all the newlines and tabs are doing there. Is there something I'm fundamentally wrong here in trying to extract the prices?
I usually use WebDriver for this sort of thing, and take advantage of finding elements by css selector, xpath and class. But in this case, I want no browser interaction, which is why I am going with urllib2
and lxml
for the first time.
etc.
Upvotes: 1
Views: 2818
Reputation: 5429
I see 2 possible cases:
I would recommend to check next:
Upvotes: 1
Reputation: 172
I write two examples on python
Example 1:
import urllib2
from lxml import etree
if __name__ == '__main__':
url = "http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Under+Armour+Dauntless+Backpack&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
xpathselector="//span[@class ='bold bidsold']"
for i in tree.xpath(xpathselector):
print "".join(filter(lambda x: ord(x)<64, i.text)).strip()
Example 2:
import urllib2
from lxml import etree
if __name__ == '__main__':
url = "http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Under+Armour+Dauntless+Backpack&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
xpathselector="//span[@class ='bold bidsold']|//span[@class='sboffer']"
for i in tree.xpath(xpathselector):
print "".join(filter(lambda x: ord(x)<64, i.text)).strip()
Upvotes: 1