lxml web-scraping, specific word extraction

Question

im working with automate my script to scrape counters from lan-website and im pulling my hairs now.

code looks like this

title
   

table one
 Task     average 

 number     number 

    1-1      C
 6490       1 
    
    2-4      C
 442        2 
    
    5-10     C
 44         6 
    
    11-20    C
 3          15 
    
    21-30    C
 2          25 
    
    31-50    C
 1          40 
    
    sum
 6982       1

so in every site i have same words repeating like 1-2, 2-4, 5-10 etc and i want to extract numbers "below it" like 6490, 442 in specific order so it should looks like

task - counter
1-1 = 6490
2-4 = 442

to do this i use

import requests
from lxml import html

pageContent=requests.get(
 'http://x.html')
tree = html.fromstring(pageContent.content)
scraped = tree.xpath('//p/text()')
print scraped

witch obviously prints something like this \xa0\xa0\xa0\xa0\xa0task ', u'1-1\xa0\xa0\xa0\xa0\xa0\xa0counter', u' 6490

i'm stuck guys... tried to use other methods but i failed.

SIM · Accepted Answer

Try this. It will fetch you the exact output you have mentioned above. Here content is the container of your above pasted html elements.

from lxml.html import fromstring
root = fromstring(content)
for items in root.cssselect("tr")[3:]:
    data = [' '.join(item.text_content().split()).split(" ")[0] for item in items.cssselect("td")]
    print(' = '.join(data))

Output:

1-1 = 6490
2-4 = 442
5-10 = 44
11-20 = 3
21-30 = 2
31-50 = 1
sum = 6982

lxml web-scraping, specific word extraction

Answers (2)

Related Questions