Reputation: 2302
I am trying to scrape a table:
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<table class="table ajax">
<thead>
<tr>
<th scope="col">
<span>NO.</span>
</th>
<th scope="col" data-index="1">
<span>Year of initiation</span>
</th>
<th scope="col" data-index="2">
<span>Short case name</span>
</th>
<th scope="col" data-index="3" style="display: none;">
<span>Full case name</span>
</th>
<th scope="col" data-index="4">
<span>Applicable IIA</span>
</tr>
</thead>
<tbody>
<tr>
<th scope="row">1</th>
<td data-index="1">
2019
</td>
<td data-index="2">
Alcosa v. Kuwait</a>
</td>
<td data-index="3" style="display: none;">
Alcosa v. The State of Kuwait
</td>
<td data-index="4">
Kuwait - Spain BIT(2005)</a> </td>
<td data-index="5"> UNCITRAL
</td>
</tbody>
</table>
</body>
</html>
with the following code:
html = driver.page_source
bs=BeautifulSoup(html, "lxml")
table = bs.find('table', { 'class' : 'ajax' })
table_body=table.find('tbody')
rows = table_body.findAll('tr')
with open('son.csv', "wt+") as f:
writer = csv.writer(f)
for row in rows:
cols = row.find_all('td')
cols = [x.get_text(strip=True, separator='|') for x in cols]
writer.writerow(cols)
I can get the table rows but I can't get table header.
This is the output I want to get:
NO. Year of initiation Short case name Applicable IIA
1 2019 Alcosa v. Kuwait Kuwait - Spain BIT(2005) UNCITRAL
How can I do it? Thanks.
Upvotes: 1
Views: 881
Reputation: 195438
You can try this script to save the table to csv:
import csv
from bs4 import BeautifulSoup
txt = '''<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<table class="table ajax">
<thead>
<tr>
<th scope="col">
<span>NO.</span>
</th>
<th scope="col" data-index="1">
<span>Year of initiation</span>
</th>
<th scope="col" data-index="2">
<span>Short case name</span>
</th>
<th scope="col" data-index="3" style="display: none;">
<span>Full case name</span>
</th>
<th scope="col" data-index="4">
<span>Applicable IIA</span>
</th>
</tr>
</thead>
<tbody>
<tr>
<th scope="row">1</th>
<td data-index="1">
2019
</td>
<td data-index="2">
Alcosa v. Kuwait
</td>
<td data-index="3" style="display: none;">
Alcosa v. The State of Kuwait
</td>
<td data-index="4">
Kuwait - Spain BIT(2005)
</td>
<td data-index="5"> UNCITRAL
</td>
</tr>
</tbody>
</table>
</body>
</html>'''
soup = BeautifulSoup(txt, 'html.parser')
headers = [th.get_text(strip=True) for th in soup.select('table.ajax thead th')]
rows = []
for row in soup.select('table.ajax tbody tr'):
data = [d.get_text(strip=True) for d in row.select('th, td')]
rows.append(data)
with open('son.csv', "wt+") as f:
writer = csv.writer(f)
writer.writerow(headers)
for row in rows:
writer.writerow(row)
Writes son.csv
(screenshot from LibreOffice):
Upvotes: 1