Reputation: 75
I am trying to parse data from a website. For e.g the portion of SRC code looks like this for the site i am trying to extract data from.
<table summary="Customer Pending and Vendor Pending Table">
<tr>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=Level&Escalationorder=0#Escalation" class="headlink">
<img src="/images/rat/up_selected.png" width="11" height="9" border="0" alt="up">Risk </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=AvgLastUpd&Escalationorder=1#Escalation" class="headlink">
Avg Last Updated </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=AvgDaysOpen&Escalationorder=1#Escalation" class="headlink">
Avg Days Open </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=Srs&Escalationorder=1#Escalation" class="headlink">
# of Cases </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort_pct=1&Escalationorder=1#Escalation" class="headlink">% of Total Cases</a> </th>
</tr>
<tr >
<td><a href="/snapshot.php?statusrisk=2&wrkgrp=Somedata&function=statusrisk&statuses=CustomerPending"><img src="/images/rat/severity_2.gif" alt="Very High Risk" title="Very High Risk" border="0"></a></td>
<td> 8.0</td>
<td> 69.0</td>
<td>1</td>
<td> 3.1</td>
</tr>
I need to extract the values 8.0,69.0 and 3.1 from teh above table. My Python code looks like this.
from lxml import html
import requests
page = requests.get('http://rat-sucker.abc.com/team.php?wrkgrp=somedata')
tree = html.fromstring(page.text)
Stats = tree.xpath(//*[@id="leftrat"]/table[1]/tbody/tr[2]/td[2])
print 'Stats: ', Stats
I have checked my Xpath using several methods and Xcode simulator, it is correct(if you run on the above partial code it may not work), but when my python script is run it does not generate any output.
[root@testbed testhost]# python scrapper.py Stats
[root@testbed testhost]#
Upvotes: 1
Views: 696
Reputation: 174874
You could use BeautifulSoup parser.
>>> s = '''<table summary="Customer Pending and Vendor Pending Table">
<tr>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=Level&Escalationorder=0#Escalation" class="headlink">
<img src="/images/rat/up_selected.png" width="11" height="9" border="0" alt="up">Risk </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=AvgLastUpd&Escalationorder=1#Escalation" class="headlink">
Avg Last Updated </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=AvgDaysOpen&Escalationorder=1#Escalation" class="headlink">
Avg Days Open </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort=Srs&Escalationorder=1#Escalation" class="headlink">
# of Cases </a> </th>
<th> <a href="/team.php?wrkgrp=Somedata&Escalationsort_pct=1&Escalationorder=1#Escalation" class="headlink">% of Total Cases</a> </th>
</tr>
<tr >
<td><a href="/snapshot.php?statusrisk=2&wrkgrp=Somedata&function=statusrisk&statuses=CustomerPending"><img src="/images/rat/severity_2.gif" alt="Very High Risk" title="Very High Risk" border="0"></a></td>
<td> 8.0</td>
<td> 69.0</td>
<td>1</td>
<td> 3.1</td>
</tr>'''
>>> soup = BeautifulSoup(s)
>>> [i.text.strip() for i in soup.find_all('td', text=True)]
['8.0', '69.0', '1', '3.1']
Upvotes: 4