Reputation: 33
I recently saw that another user had asked a question about extracting information from a web table Extracting information from a webpage with python. The answer from ekhumoro works great on the page that the other user asked. See below.
from urllib2 import urlopen
from lxml import etree
url = 'http://www.uscho.com/standings/division-i-men/2011-2012/'
tree = etree.HTML(urlopen(url).read())
for section in tree.xpath('//section[starts-with(@id, "section_")]'):
print section.xpath('h3[1]/text()')[0]
for row in section.xpath('table/tbody/tr'):
cols = row.xpath('td//text()')
print ' ', cols[0].ljust(25), ' '.join(cols[1:])
print
My problem is using this code as a guide to parse this page http://www.uscho.com/rankings/d-i-mens-poll/ . Using the following changes I can only get h1 and h3 to print.
Input
url = 'http://www.uscho.com/rankings/d-i-mens-poll/'
tree = etree.HTML(urlopen(url).read())
for section in tree.xpath('//section[starts-with(@id, "rankings")]'):
print section.xpath('h1[1]/text()')[0]
print section.xpath('h3[1]/text()')[0]
for row in section.xpath('table/tbody/tr'):
cols = row.xpath('td/b/text()')
print ' ', cols[0].ljust(25), ' '.join(cols[1:])
print
Output
USCHO.com Division I Men's Poll
December 12, 2011
The structure of the table seems to be the same so I'm at a loss as to why I can't use similar code. I'm just a mechanical engineer in way over my head. Any help is appreciated.
Upvotes: 3
Views: 6777
Reputation: 2514
Though this answer is old, it still comes up on the web.
I'd like another (more straightforward and uptodate) option. It adds up more dependancies though (pandas & tabulate (which is a dependancy to the to_markdown method))...
Unfortunatelly, I think the webpage associated to the url used in this question has changed quite a lot since then (the table is now generated from javascript and is not in the source code anymore). So I'll skip to this url instead for practical purpose.
from lxml import etree, html
import pandas as pd
import requests
url = 'https://www.w3schools.com/html/html_tables.asp'
r = requests.get(url)
#If you want to get a specific table, procede as follow :
tree_html = html.fromstring(r.content)
first_table = tree_html.xpath(".//table")[0]
df = pd.read_html(etree.tostring(table))[0]
print(df.to_markdown())
Output :
| | Tag | Description |
|---:|:-----------|:------------------------------------------------------------------------|
| 0 | <table> | Defines a table |
| 1 | <th> | Defines a header cell in a table |
| 2 | <tr> | Defines a row in a table |
| 3 | <td> | Defines a cell in a table |
| 4 | <caption> | Defines a table caption |
| 5 | <colgroup> | Specifies a group of one or more columns in a table for formatting |
| 6 | <col> | Specifies column properties for each column within a <colgroup> element |
| 7 | <thead> | Groups the header content in a table |
| 8 | <tbody> | Groups the body content in a table |
| 9 | <tfoot> | Groups the footer content in a table |
But you can also get all tables in one shot, this way :
list_tables = pd.read_html(r.content)
for table in list_table:
print(table.to_markdown()+'\n')
Output :
| | Company | Contact | Country |
|---:|:-----------------------------|:-----------------|:----------|
| 0 | Alfreds Futterkiste | Maria Anders | Germany |
| 1 | Centro comercial Moctezuma | Francisco Chang | Mexico |
| 2 | Ernst Handel | Roland Mendel | Austria |
| 3 | Island Trading | Helen Bennett | UK |
| 4 | Laughing Bacchus Winecellars | Yoshi Tannamuri | Canada |
| 5 | Magazzini Alimentari Riuniti | Giovanni Rovelli | Italy |
| | Tag | Description |
|---:|:-----------|:------------------------------------------------------------------------|
| 0 | <table> | Defines a table |
| 1 | <th> | Defines a header cell in a table |
| 2 | <tr> | Defines a row in a table |
| 3 | <td> | Defines a cell in a table |
| 4 | <caption> | Defines a table caption |
| 5 | <colgroup> | Specifies a group of one or more columns in a table for formatting |
| 6 | <col> | Specifies column properties for each column within a <colgroup> element |
| 7 | <thead> | Groups the header content in a table |
| 8 | <tbody> | Groups the body content in a table |
| 9 | <tfoot> | Groups the footer content in a table |
Upvotes: 0
Reputation: 120598
The structure of the table is slightly different, and there are columns with blank entries.
Possible lxml
solution:
from urllib2 import urlopen
from lxml import etree
url = 'http://www.uscho.com/rankings/d-i-mens-poll/'
tree = etree.HTML(urlopen(url).read())
for section in tree.xpath('//section[@id="rankings"]'):
print section.xpath('h1[1]/text()')[0],
print section.xpath('h3[1]/text()')[0]
print
for row in section.xpath('table/tr[@class="even" or @class="odd"]'):
print '%-3s %-20s %10s %10s %10s %10s' % tuple(
''.join(col.xpath('.//text()')) for col in row.xpath('td'))
print
Output:
USCHO.com Division I Men's Poll December 12, 2011
1 Minnesota-Duluth (49) 12-3-3 999 1
2 Minnesota 14-5-1 901 2
3 Boston College 12-6-0 875 3
4 Ohio State ( 1) 13-4-1 848 4
5 Merrimack 10-2-2 844 5
6 Notre Dame 11-6-3 667 7
7 Colorado College 9-5-0 650 6
8 Western Michigan 9-4-5 647 8
9 Boston University 10-5-1 581 11
10 Ferris State 11-6-1 521 9
11 Union 8-3-5 510 10
12 Colgate 11-4-2 495 12
13 Cornell 7-3-1 347 16
14 Denver 7-6-3 329 13
15 Michigan State 10-6-2 306 14
16 Lake Superior 11-7-2 258 15
17 Massachusetts-Lowell 10-5-0 251 18
18 North Dakota 9-8-1 88 19
19 Yale 6-5-1 69 17
20 Michigan 9-8-3 62 NR
Upvotes: 2
Reputation: 40374
lxml
is great, but if you're not familiar with xpath
, I recommend you BeautifulSoup
:
from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://www.uscho.com/rankings/d-i-mens-poll/'
soup = BeautifulSoup(urlopen(url).read())
section = soup.find('section', id='rankings')
h1 = section.find('h1')
print h1.text
h3 = section.find('h3')
print h3.text
print
rows = section.find('table').findAll('tr')[1:-1]
for row in rows:
columns = [data.text for data in row.findAll('td')[1:]]
print '{0:20} {1:4} {2:>6} {3:>4}'.format(*columns)
The output for this script is:
USCHO.com Division I Men's Poll
December 12, 2011
Minnesota-Duluth (49) 12-3-3 999
Minnesota 14-5-1 901
Boston College 12-6-0 875
Ohio State ( 1) 13-4-1 848
Merrimack 10-2-2 844
Notre Dame 11-6-3 667
Colorado College 9-5-0 650
Western Michigan 9-4-5 647
Boston University 10-5-1 581
Ferris State 11-6-1 521
Union 8-3-5 510
Colgate 11-4-2 495
Cornell 7-3-1 347
Denver 7-6-3 329
Michigan State 10-6-2 306
Lake Superior 11-7-2 258
Massachusetts-Lowell 10-5-0 251
North Dakota 9-8-1 88
Yale 6-5-1 69
Michigan 9-8-3 62
Upvotes: 4