Reputation: 517
I want to extract all the tables from a html file as given below using BeautifulSoup and write it to csv.
Html looks like below:
<h4>Site Name : Aria</h4>
<table style="width: 100%">
<tbody><tr>
<th style="width: 25%"><strong>Dn Name:</strong></th>
<td style="width: 25%"><strong>Aria</strong></td>
<th style="width: 25%"><strong>WL:</strong></th>
<td style="width: 25%"><strong> Meters (m)</strong></td>
</tr>
<tr>
<th><strong>River Name:</strong></th>
<td><strong>Ben</strong></td>
<th><strong>DL:</strong></th>
<td><strong> Meters (m)</strong></td>
</tr>
<tr>
<th><strong>Basin Name:</strong></th>
<td><strong>GAN<strong></strong></strong></td>
<th><strong>HFL:</strong></th>
<td><strong>49.4 Meters (m)<strong></strong></strong></td>
</tr>
<tr>
<th><strong>Div Name:</strong></th>
<td><a target="_blank" href="http://imd.gov.in/ onclick="window.open(this.href, this.target, 'width=1000, height=600, toolbar=no, resizable=no'); return false;">LGD-I</a></td>
<th><strong>HFL date:</strong></th>
<td>14-08-2017</td>
</tr>
</tbody></table>
<p> </p>
<table>
<tbody><tr>
<th colspan="3" style="text-align: center;"><strong>PRESENT WL</strong></th>
</tr>
<tr>
<td class="" style="width:33%; height:18px;">Date: 17-07-2018 12:00</td>
<td class="" style="width:33%;">Value: 45.43 Meters (m)</td>
<td class="" style="width:33%;">Trend: Steady</td>
</tr>
<tr>
<th colspan="3" style="text-align: center;"><strong>CUMULATIVE DAILY RF</strong></th>
</tr>
<tr>
<td style="width:33%; height:18px;">Date: 17-07-2018 08:30</td>
<td style="width:33%;">Value: 0.0 Milimiters (mm)</td>
<td style="width:33%;"></td>
</tr>
</tbody></table>
<p> </p>
<table style="width: 100%">
<tbody><tr>
<th colspan="4" style="text-align: center;"><strong>NO FORECAST</strong></th>
</tr>
</tbody></table>
</div>
I amble to extract the text from all the three table but I am not able to write it in the desired format
My code
now = datetime.datetime.now()
date = now.strftime("%d-%m-%Y")
os.chdir(r'D:\shared')
soup = BeautifulSoup(response.text,"html5lib")
tables = soup.find_all("tr")
test =[]
for table in tables:
test.append(table.get_text())
filename = 'Water'+'-'+str(date)+'.csv'
out = open(filename, mode='ab')
writer = csv.writer(out)
writer.writerow(data)
out.close()
In output csv first table is being written in 1st column 2nd table is being written in second table and third table is written in 3rd column.
I want data in the following format:
Site Name: Aria
Dn Name: Aria
WL: Meters (m)
River Name: Ben
DL: Meters (m)
Basin Name: GAN
HFL: 49.4 Meters (m)
Div Name: LGD-I)
HFL date: 14-08-2017
PRESENT WL
Date: 17-07-2018 12:00
Value: 45.43 Meters (m)
Trend: Steady
CUMULATIVE
DAILY RF
Date: 17-07-2018 08:30
Value: 0.0 Milimiters (mm)
NO FORECAST
Upvotes: 0
Views: 234
Reputation: 195468
My attempt to this problem:
data = """
<h4>Site Name : Aria</h4>
<table style="width: 100%">
<tbody><tr>
<th style="width: 25%"><strong>Dn Name:</strong></th>
<td style="width: 25%"><strong>Aria</strong></td>
<th style="width: 25%"><strong>WL:</strong></th>
<td style="width: 25%"><strong> Meters (m)</strong></td>
</tr>
<tr>
<th><strong>River Name:</strong></th>
<td><strong>Ben</strong></td>
<th><strong>DL:</strong></th>
<td><strong> Meters (m)</strong></td>
</tr>
<tr>
<th><strong>Basin Name:</strong></th>
<td><strong>GAN<strong></strong></strong></td>
<th><strong>HFL:</strong></th>
<td><strong>49.4 Meters (m)<strong></strong></strong></td>
</tr>
<tr>
<th><strong>Div Name:</strong></th>
<td><a target="_blank" href="http://imd.gov.in/ onclick="window.open(this.href, this.target, 'width=1000, height=600, toolbar=no, resizable=no'); return false;">LGD-I</a></td>
<th><strong>HFL date:</strong></th>
<td>14-08-2017</td>
</tr>
</tbody></table>
<p> </p>
<table>
<tbody><tr>
<th colspan="3" style="text-align: center;"><strong>PRESENT WL</strong></th>
</tr>
<tr>
<td class="" style="width:33%; height:18px;">Date: 17-07-2018 12:00</td>
<td class="" style="width:33%;">Value: 45.43 Meters (m)</td>
<td class="" style="width:33%;">Trend: Steady</td>
</tr>
<tr>
<th colspan="3" style="text-align: center;"><strong>CUMULATIVE DAILY RF</strong></th>
</tr>
<tr>
<td style="width:33%; height:18px;">Date: 17-07-2018 08:30</td>
<td style="width:33%;">Value: 0.0 Milimiters (mm)</td>
<td style="width:33%;"></td>
</tr>
</tbody></table>
<p> </p>
<table style="width: 100%">
<tbody><tr>
<th colspan="4" style="text-align: center;"><strong>NO FORECAST</strong></th>
</tr>
</tbody></table>
</div>"""
import os
import datetime
from bs4 import BeautifulSoup
from pprint import pprint
# For Python 2.7 the next line should be "from itertools import izip_longest"
from itertools import zip_longest
import csv
now = datetime.datetime.now()
date = now.strftime("%d-%m-%Y")
# os.chdir(r'D:\shared')
soup = BeautifulSoup(data, "lxml")
tables = []
for table in soup.find_all('table'):
current_table = []
tables.append(current_table)
for row in table.find_all("tr"):
for (th, td) in zip_longest(row.find_all('th'), row.find_all('td')):
s = ("%s %s" % (th.text.strip() if th else '', td.text.strip() if td else '')).strip()
if s:
current_table.append(s)
tables[0].insert(0, ': '.join(w.strip() for w in soup.find('h4').text.split(':')))
for table in tables:
for i in table:
print(i)
filename = 'CWC-Water'+'-'+str(date)+'.csv'
out = open(filename, mode='w')
writer = csv.writer(out)
for table in zip_longest(*tables):
writer.writerow(table)
out.close()
Prints:
Site Name: Aria
Dn Name: Aria
WL: Meters (m)
River Name: Ben
DL: Meters (m)
Basin Name: GAN
HFL: 49.4 Meters (m)
Div Name: LGD-I
HFL date: 14-08-2017
PRESENT WL
Date: 17-07-2018 12:00
Value: 45.43 Meters (m)
Trend: Steady
CUMULATIVE DAILY RF
Date: 17-07-2018 08:30
Value: 0.0 Milimiters (mm)
NO FORECAST
And outputs .csv file with following format (3-columns from tables, screenshot from LibreOffice):
Edit: - correct picture
Upvotes: 2