Wicaledon
Wicaledon

Reputation: 810

Python : HTML to CSV with Multiple Tables using BeautifulSoup

I'm trying to build a code that converting .html file to .csv file.

I wrote a code that works if html file contains only 1 table.

from bs4 import BeautifulSoup
import csv

html = open("table.html").read()
soup = BeautifulSoup(html)
table = soup.find("table")

output_rows = []
for table_row in table.findAll('tr'):
    columns = table_row.findAll('td')
    output_row = []
    for column in columns:
        output_row.append(column.text)
    output_rows.append(output_row)
print(output_rows)

with open('output.csv', 'a') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(output_rows)

For checking is it working perfectly, I'm writing the output rows. This code shows them perfectly; [['Data 1', 'Data 2', 'Data 3'], ['Hello', 'World', 'Wicaledon']]

And the table.html file is like this:

<table>
  <tr>
    <td>Data 1</td>
    <td>Data 2</td>
    <td>Data 3</td>
  </tr>
  <tr>
    <td>Hello</td>
    <td>World</td>
    <td>Wicaledon</td>
  </tr>
</table>

But the problem is; If I use a table.html file that contains 2 table like this;

<html>
  <head>
    <title>Test Table</title>
  </head>
  <body>
    <h2>First Table</h2>
    <table>
      <tr>
        <td>A</td>
        <td>B</td>
      </tr>
      <tr>
        <td>C</td>
        <td>D</td>
      </tr>
      <tr>
        <td>E</td>
        <td>F</td>
      </tr>
      <tr>
        <td>G</td>
        <td>H</td>
      </tr>
    </table>

    <h2>Second Table</h2>
    <table>
      <tr>
        <td>1</td>
        <td>2</td>
        <td>3</td>
        <td>4</td>
        <td>5</td>
        <td>6</td>
      </tr>
      <tr>
        <td>2</td>
        <td>3</td>
        <td>4</td>
        <td>5</td>
        <td>6</td>
        <td>7</td>
      </tr>
      <tr>
        <td>3</td>
        <td>4</td>
        <td>5</td>
        <td>6</td>
        <td>7</td>
        <td>8</td>
      </tr>
      <tr>
        <td>4</td>
        <td>5</td>
        <td>6</td>
        <td>7</td>
        <td>8</td>
        <td>9</td>
      </tr>
      <tr>
        <td>5</td>
        <td>6</td>
        <td>7</td>
        <td>8</td>
        <td>9</td>
        <td>10</td>
      </tr>
      <tr>
        <td>6</td>
        <td>7</td>
        <td>8</td>
        <td>9</td>
        <td>10</td>
        <td>11</td>
      </tr>
    </table>
    </table>
  </body>
</html>

It is writing the output rows like this; [['A', 'B'], ['C', 'D'], ['E', 'F'], ['G', 'H']]

And csv file contains only this arrays.

True output must be like;

[['A', 'B'], ['C', 'D'], ['E', 'F'], ['G', 'H']]
[['1', '2', '3', '4', '5', '6'],
 ['2', '3', '4', '5', '6', '7'],
 ['3', '4', '5', '6', '7', '8'],
 ['4', '5', '6', '7', '8', '9'],
 ['5', '6', '7', '8', '9', '10'],
 ['6', '7', '8', '9', '10', '11']]

And these 2 arrays should be written in csv file.

How can I fix my code using BeautifulSoup and csv modules

Upvotes: 0

Views: 969

Answers (1)

KunduK
KunduK

Reputation: 33384

This is because you have used find().Find will return 1st match.you need to use find_all() to get all the tables.try now.

from bs4 import BeautifulSoup
data='''<html>
  <head>
    <title>Test Table</title>
  </head>
  <body>
    <h2>First Table</h2>
    <table>
      <tr>
        <td>A</td>
        <td>B</td>
      </tr>
      <tr>
        <td>C</td>
        <td>D</td>
      </tr>
      <tr>
        <td>E</td>
        <td>F</td>
      </tr>
      <tr>
        <td>G</td>
        <td>H</td>
      </tr>
    </table>

    <h2>Second Table</h2>
    <table>
      <tr>
        <td>1</td>
        <td>2</td>
        <td>3</td>
        <td>4</td>
        <td>5</td>
        <td>6</td>
      </tr>
      <tr>
        <td>2</td>
        <td>3</td>
        <td>4</td>
        <td>5</td>
        <td>6</td>
        <td>7</td>
      </tr>
      <tr>
        <td>3</td>
        <td>4</td>
        <td>5</td>
        <td>6</td>
        <td>7</td>
        <td>8</td>
      </tr>
      <tr>
        <td>4</td>
        <td>5</td>
        <td>6</td>
        <td>7</td>
        <td>8</td>
        <td>9</td>
      </tr>
      <tr>
        <td>5</td>
        <td>6</td>
        <td>7</td>
        <td>8</td>
        <td>9</td>
        <td>10</td>
      </tr>
      <tr>
        <td>6</td>
        <td>7</td>
        <td>8</td>
        <td>9</td>
        <td>10</td>
        <td>11</td>
      </tr>
    </table>
    </table>
  </body>
</html>'''

soup=BeautifulSoup(data,'html.parser')
tables = soup.find_all("table")

output_rows = []
for table in tables:
 for table_row in table.findAll('tr'):
    columns = table_row.findAll('td')
    output_row = []
    for column in columns:
        output_row.append(column.text)
    output_rows.append(output_row)
print(output_rows)

output:

[['A', 'B'], ['C', 'D'], ['E', 'F'], ['G', 'H'], ['1', '2', '3', '4', '5', '6'], ['2', '3', '4', '5', '6', '7'], ['3', '4', '5', '6', '7', '8'], ['4', '5', '6', '7', '8', '9'], ['5', '6', '7', '8', '9', '10'], ['6', '7', '8', '9', '10', '11']]

Updated the code

soup=BeautifulSoup(data,'html.parser')
tables = soup.find_all("table")
output_final_rows=[]

for table in tables:
  output_rows = []
  for table_row in table.findAll('tr'):

    columns = table_row.findAll('td')
    output_row = []
    for column in columns:
        output_row.append(column.text)
    output_rows.append(output_row)
  output_final_rows.append(output_rows)

print(output_final_rows)

Upvotes: 1

Related Questions