Reputation: 810
I'm trying to build a code that converting .html
file to .csv
file.
I wrote a code that works if html file contains only 1 table.
from bs4 import BeautifulSoup
import csv
html = open("table.html").read()
soup = BeautifulSoup(html)
table = soup.find("table")
output_rows = []
for table_row in table.findAll('tr'):
columns = table_row.findAll('td')
output_row = []
for column in columns:
output_row.append(column.text)
output_rows.append(output_row)
print(output_rows)
with open('output.csv', 'a') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(output_rows)
For checking is it working perfectly, I'm writing the output rows. This code shows them perfectly;
[['Data 1', 'Data 2', 'Data 3'], ['Hello', 'World', 'Wicaledon']]
And the table.html
file is like this:
<table>
<tr>
<td>Data 1</td>
<td>Data 2</td>
<td>Data 3</td>
</tr>
<tr>
<td>Hello</td>
<td>World</td>
<td>Wicaledon</td>
</tr>
</table>
But the problem is; If I use a table.html
file that contains 2 table like this;
<html>
<head>
<title>Test Table</title>
</head>
<body>
<h2>First Table</h2>
<table>
<tr>
<td>A</td>
<td>B</td>
</tr>
<tr>
<td>C</td>
<td>D</td>
</tr>
<tr>
<td>E</td>
<td>F</td>
</tr>
<tr>
<td>G</td>
<td>H</td>
</tr>
</table>
<h2>Second Table</h2>
<table>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
</tr>
<tr>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
</tr>
<tr>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
</tr>
<tr>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
</tr>
<tr>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
</tr>
<tr>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
<td>11</td>
</tr>
</table>
</table>
</body>
</html>
It is writing the output rows like this;
[['A', 'B'], ['C', 'D'], ['E', 'F'], ['G', 'H']]
And csv file contains only this arrays.
True output must be like;
[['A', 'B'], ['C', 'D'], ['E', 'F'], ['G', 'H']]
[['1', '2', '3', '4', '5', '6'],
['2', '3', '4', '5', '6', '7'],
['3', '4', '5', '6', '7', '8'],
['4', '5', '6', '7', '8', '9'],
['5', '6', '7', '8', '9', '10'],
['6', '7', '8', '9', '10', '11']]
And these 2 arrays should be written in csv file.
How can I fix my code using BeautifulSoup and csv modules
Upvotes: 0
Views: 969
Reputation: 33384
This is because you have used find().Find will return 1st match.you need to use find_all() to get all the tables.try now.
from bs4 import BeautifulSoup
data='''<html>
<head>
<title>Test Table</title>
</head>
<body>
<h2>First Table</h2>
<table>
<tr>
<td>A</td>
<td>B</td>
</tr>
<tr>
<td>C</td>
<td>D</td>
</tr>
<tr>
<td>E</td>
<td>F</td>
</tr>
<tr>
<td>G</td>
<td>H</td>
</tr>
</table>
<h2>Second Table</h2>
<table>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
</tr>
<tr>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
</tr>
<tr>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
</tr>
<tr>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
</tr>
<tr>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
</tr>
<tr>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td>10</td>
<td>11</td>
</tr>
</table>
</table>
</body>
</html>'''
soup=BeautifulSoup(data,'html.parser')
tables = soup.find_all("table")
output_rows = []
for table in tables:
for table_row in table.findAll('tr'):
columns = table_row.findAll('td')
output_row = []
for column in columns:
output_row.append(column.text)
output_rows.append(output_row)
print(output_rows)
[['A', 'B'], ['C', 'D'], ['E', 'F'], ['G', 'H'], ['1', '2', '3', '4', '5', '6'], ['2', '3', '4', '5', '6', '7'], ['3', '4', '5', '6', '7', '8'], ['4', '5', '6', '7', '8', '9'], ['5', '6', '7', '8', '9', '10'], ['6', '7', '8', '9', '10', '11']]
soup=BeautifulSoup(data,'html.parser')
tables = soup.find_all("table")
output_final_rows=[]
for table in tables:
output_rows = []
for table_row in table.findAll('tr'):
columns = table_row.findAll('td')
output_row = []
for column in columns:
output_row.append(column.text)
output_rows.append(output_row)
output_final_rows.append(output_rows)
print(output_final_rows)
Upvotes: 1