Astro David
Astro David

Reputation: 81

Extracting text/numbers from HTML list using Python requests and lxml

I am trying to extract the 'Seller rank' from items on amazon using Python requests and lxml. So:

<li id="SalesRank">
<b>Amazon Bestsellers Rank:</b> 

957,875 in Books (<a href="http://www.amazon.co.uk/gp/bestsellers/books/ref=pd_dp_ts_b_1">See Top 100 in Books</a>)

from this example, 957875 is the number I want to extract.

(Please note, the actual HTML has about 100 blank lines between 'Amazon Bestsellers Rank:' and '957875'. Unsure if this is effecting my result.)

My current Python code is set up like so:

import re
import requests
from lxml import html

page = requests.get('http://www.amazon.co.uk/Lakeland-Expanding-Together-Compartments-Organiser/dp/B00A7Q77GM/ref=sr_1_1?s=kitchen&ie=UTF8&qid=1452504370&sr=1-1-spons&psc=1')
tree = html.fromstring(page.content)
salesrank = tree.xpath('//li[@id="SalesRank"]/text()')
print 'Sales Rank:', salesrank

and the printed output is Sales Rank: []

I was expecting to receive the full list data including all the blank lines of which I would later parse. Am I correct in assuming that /text() is not the correct use in this instance and I need to put something else? Any help is greatly appreciated.

Upvotes: 0

Views: 1074

Answers (1)

Jayant Jaiswal
Jayant Jaiswal

Reputation: 181

You are getting an empty list because in one call of the url you are not getting the complete data of the web page. For that you have to stream through the url and get all the data in small chunks. And then find out the required in the non-empty chunk. The code for the following is :-

import requests as rq
import re
from bs4 import BeautifulSoup as bs
r=rq.get('http://www.amazon.in/gp/product/0007950306/ref=s9_al_bw_g14_i1?pf_rd_m=A1VBAL9TL5WCBF&pf_rd_s=merchandised-search-3&pf_rd_r=1XBKB22RGT2HBKH4K2NP&pf_rd_t=101&pf_rd_p=798805127&pf_rd_i=4143742031',stream=True)

for chunk in r.iter_content(chunk_size=1024):
    if chunk:
        data = chunk
        soup=bs(data)
        elem=soup.find_all('li',attrs={'id':'SalesRank'})
        if elem!=[]:
            s=re.findall('#[\d+,*]*\sin',str(elem[0]))
            print s[0].split()[0]
            break

Upvotes: 1

Related Questions