Reputation: 81
I am trying to extract the 'Seller rank' from items on amazon using Python requests and lxml. So:
<li id="SalesRank">
<b>Amazon Bestsellers Rank:</b>
957,875 in Books (<a href="http://www.amazon.co.uk/gp/bestsellers/books/ref=pd_dp_ts_b_1">See Top 100 in Books</a>)
from this example, 957875 is the number I want to extract.
(Please note, the actual HTML has about 100 blank lines between 'Amazon Bestsellers Rank:' and '957875'. Unsure if this is effecting my result.)
My current Python code is set up like so:
import re
import requests
from lxml import html
page = requests.get('http://www.amazon.co.uk/Lakeland-Expanding-Together-Compartments-Organiser/dp/B00A7Q77GM/ref=sr_1_1?s=kitchen&ie=UTF8&qid=1452504370&sr=1-1-spons&psc=1')
tree = html.fromstring(page.content)
salesrank = tree.xpath('//li[@id="SalesRank"]/text()')
print 'Sales Rank:', salesrank
and the printed output is
Sales Rank: []
I was expecting to receive the full list data including all the blank lines of which I would later parse. Am I correct in assuming that /text() is not the correct use in this instance and I need to put something else? Any help is greatly appreciated.
Upvotes: 0
Views: 1074
Reputation: 181
You are getting an empty list because in one call of the url you are not getting the complete data of the web page. For that you have to stream through the url and get all the data in small chunks. And then find out the required in the non-empty chunk. The code for the following is :-
import requests as rq
import re
from bs4 import BeautifulSoup as bs
r=rq.get('http://www.amazon.in/gp/product/0007950306/ref=s9_al_bw_g14_i1?pf_rd_m=A1VBAL9TL5WCBF&pf_rd_s=merchandised-search-3&pf_rd_r=1XBKB22RGT2HBKH4K2NP&pf_rd_t=101&pf_rd_p=798805127&pf_rd_i=4143742031',stream=True)
for chunk in r.iter_content(chunk_size=1024):
if chunk:
data = chunk
soup=bs(data)
elem=soup.find_all('li',attrs={'id':'SalesRank'})
if elem!=[]:
s=re.findall('#[\d+,*]*\sin',str(elem[0]))
print s[0].split()[0]
break
Upvotes: 1