Reputation: 58
I'm trying to get values from an html table using Python. The html looks like this:
<table border=1 width=900>
<tr><td width=50%>
<table>
<tr><td align=right><b>Invoice #</td><td><input type=text value="1624140" size=12></td></tr>
<tr><td align=right>Company</td><td><input type=text value="NZone" size=40></td></tr>
<tr><td align=right>Name:</td><td><input type=text value="John Dot" size=40></td></tr>
<tr><td align=right>Address:</td><td><input type=text value="Posie Row, Moscow Road" size=40></td></tr>
<tr><td align=right>City:</td><td><input type=text value="Co. Dubllin" size=40></td></tr>
<tr><td align=right>Province</td><td><input type=text value="" size=40></td></tr>
<tr><td align=right>Postal Code:</td><td><input type=text value="" size=40></td></tr>
<tr><td align=right>Country:</td><td><input type=text value="IRELAND" size=40></td></tr>
<tr><td align=right>Date:</td><td><input type=text value="24.4.18" size=12></td></tr>
<tr><td align=right>Sub Total:</td><td><input type=text value="93,24" size=40></td></tr>
<tr><td align=right>Combined Weight:</td><td><input type=text value="1,24" size=40></td></tr>
</table>
My code so far is:
from __future__ import print_function
import requests
import re
from bs4 import BeautifulSoup as bs
request = requests.get('url')
content = request.content
soup = bs(content, 'html.parser')
table = soup.findChildren('table')[1]
rows = table.findChildren('tr')
for row in rows:
cells = row.findChildren('td')
for cell in cells:
cell_content = cell.getText()
print(cell_content)
Output is:
Invoice #
Company
Name:
Address:
City:
Province
Postal Code:
Country:
Date:
Sub Total:
Combined Weight:
I would like final output like the following:
Invoice:1624140
Company:NZone
Name:John Dot
Address:Possie Row, Moscow Road
City:Co. Dublin
Province:
Postal Code:
Country:IRELAND
Date:24.4.18
Sub Total:93,24
Combined Weight:1,24
Upvotes: 1
Views: 8733
Reputation: 2076
Replace your bottom loop with this:
for row in rows:
[row_title, row_val] = row.findChildren('td')
print(row_title.getText(), row_val.input['value'])
This code unpacks the two cells in each row. It then gets the immediate child text of the left td
for the row title and drills down into the right td
for the value.
Upvotes: 0
Reputation: 174662
How about a dictionary comprehension?
d = {k.findChild('td').getText().strip():k.findChild('input')['value'] for k in rows}
The result is a dictionary like this:
{'Address:': 'Posie Row, Moscow Road',
'City:': 'Co. Dubllin',
'Combined Weight:': '1,24',
'Company': 'NZone',
'Country:': 'IRELAND',
'Date:': '24.4.18',
'Invoice #': '1624140',
'Name:': 'John Dot',
'Postal Code:': '',
'Province': '',
'Sub Total:': '93,24'}
Upvotes: 1
Reputation: 128
After Assigning row object, maybe you intended to write this? Because your current code has some indentation error. Please see if it fixes your issue.
rows = table.findChildren('tr')
for row in rows:
cells = row.findChildren('td')
for cell in cells:
cell_content = cell.getText()
print(cell_content)
Upvotes: 0
Reputation: 195543
data = """
<table border=1 width=900>
<tr><td width=50%>
<table>
<tr><td align=right><b>Invoice #</td><td><input type=text value="1624140" size=12></td></tr>
<tr><td align=right>Company</td><td><input type=text value="NZone" size=40></td></tr>
<tr><td align=right>Name:</td><td><input type=text value="John Dot" size=40></td></tr>
<tr><td align=right>Address:</td><td><input type=text value="Posie Row, Moscow Road" size=40></td></tr>
<tr><td align=right>City:</td><td><input type=text value="Co. Dubllin" size=40></td></tr>
<tr><td align=right>Province</td><td><input type=text value="" size=40></td></tr>
<tr><td align=right>Postal Code:</td><td><input type=text value="" size=40></td></tr>
<tr><td align=right>Country:</td><td><input type=text value="IRELAND" size=40></td></tr>
<tr><td align=right>Date:</td><td><input type=text value="24.4.18" size=12></td></tr>
<tr><td align=right>Sub Total:</td><td><input type=text value="93,24" size=40></td></tr>
<tr><td align=right>Combined Weight:</td><td><input type=text value="1,24" size=40></td></tr>
</table>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
for (td, inp) in zip(soup.find_all('td', align="right"), soup.find_all('input')):
print(td.text, inp['value'])
Output is:
Invoice # 1624140
Company NZone
Name: John Dot
Address: Posie Row, Moscow Road
City: Co. Dubllin
Province
Postal Code:
Country: IRELAND
Date: 24.4.18
Sub Total: 93,24
Combined Weight: 1,24
Upvotes: 2