Mehmet Balioglu
Mehmet Balioglu

Reputation: 2302

Scraping table header with Beautiful Soup

I am trying to scrape a table:

<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>
<body>

    <table class="table ajax">
    <thead>
        <tr>
            <th scope="col">
                <span>NO.</span>
            </th>
            <th scope="col" data-index="1">
                    <span>Year of initiation</span>
                              
            </th>
            <th scope="col" data-index="2">
                
                    <span>Short case name</span>
                    
                
            </th>
            <th scope="col" data-index="3" style="display: none;">
                
                    <span>Full case name</span>
                    
                
            </th>
            <th scope="col" data-index="4">
               
                    <span>Applicable IIA</span>
                    
                
        </tr>
    </thead>
    <tbody>
            <tr>
                <th scope="row">1</th>
                <td data-index="1">
                    2019
                </td>
                <td data-index="2">
                   Alcosa v. Kuwait</a>
                </td>
                <td data-index="3" style="display: none;">
                    Alcosa v. The State of Kuwait
                </td>
                <td data-index="4">
Kuwait - Spain BIT(2005)</a>                </td>
                <td data-index="5"> UNCITRAL
               </td>
</tbody>
</table>

</body>
</html>

with the following code:

html = driver.page_source
bs=BeautifulSoup(html, "lxml")
table = bs.find('table', { 'class' : 'ajax' })
table_body=table.find('tbody')
rows = table_body.findAll('tr')

with open('son.csv', "wt+") as f:
    writer = csv.writer(f)
    for row in rows:
        cols = row.find_all('td')
        cols = [x.get_text(strip=True, separator='|') for x in cols]
        writer.writerow(cols)

I can get the table rows but I can't get table header.

This is the output I want to get:

NO. Year of initiation  Short case name Applicable IIA
1   2019    Alcosa v. Kuwait    Kuwait - Spain BIT(2005)    UNCITRAL

How can I do it? Thanks.

Upvotes: 1

Views: 881

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195438

You can try this script to save the table to csv:

import csv
from bs4 import BeautifulSoup    


txt = '''<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>
<body>

    <table class="table ajax">
    <thead>
        <tr>
            <th scope="col">
                <span>NO.</span>
            </th>
            <th scope="col" data-index="1">
                    <span>Year of initiation</span>
                              
            </th>
            <th scope="col" data-index="2">
                
                    <span>Short case name</span>
                    
                
            </th>
            <th scope="col" data-index="3" style="display: none;">
                
                    <span>Full case name</span>
            </th>
            <th scope="col" data-index="4">               
                    <span>Applicable IIA</span>
                    
             </th>   
        </tr>
    </thead>
    <tbody>
            <tr>
                <th scope="row">1</th>
                <td data-index="1">
                    2019
                </td>
                <td data-index="2">
                   Alcosa v. Kuwait
                </td>
                <td data-index="3" style="display: none;">
                    Alcosa v. The State of Kuwait
                </td>
                <td data-index="4">
                    Kuwait - Spain BIT(2005)
                </td>
                <td data-index="5"> UNCITRAL
               </td>
            </tr>
</tbody>
</table>

</body>
</html>'''


soup = BeautifulSoup(txt, 'html.parser')

headers = [th.get_text(strip=True) for th in soup.select('table.ajax thead th')]
rows = []
for row in soup.select('table.ajax tbody tr'):
    data = [d.get_text(strip=True) for d in row.select('th, td')]
    rows.append(data)
   
with open('son.csv', "wt+") as f:
    writer = csv.writer(f)
    writer.writerow(headers)
    for row in rows:
        writer.writerow(row)

Writes son.csv (screenshot from LibreOffice):

enter image description here

Upvotes: 1

Related Questions