okeyla
okeyla

Reputation: 65

Select a table without tag by Beautifulsoup

Could BeautifulSoup select no tag table? There's many tables in a HTML, but the data I want is in the table without any tags.

Here is my example: There are 2 tables in HTML. One is english, and the other is number.

from bs4 import BeautifulSoup

HTML2 = """
<table>
    <tr>
        <td class>a</td>
        <td class>b</td>
        <td class>c</td>
        <td class>d</td>
    </tr>
    <tr>
        <td class>e</td>
        <td class>f</td>
        <td class>g</td>
        <td class>h</td>
    </tr>
</table>

<table cellpadding="0">
    <tr>
        <td class>111</td>
        <td class>222</td>
        <td class>333</td>
        <td class>444</td>
    </tr>
    <tr>
        <td class>555</td>
        <td class>666</td>
        <td class>777</td>
        <td class>888</td>
    </tr>
"""
soup2 = BeautifulSoup(HTML2, 'html.parser')
f2 = soup2.select('table[cellpadding!="0"]') #<---I think the key point is here.
for div in f2:
    row = ''
    rows = div.findAll('tr')
    for row in rows:
        if(row.text.find('td') != False):
            print(row.text)

I only want the data in the "english" table And make the format like following:

a b c d
e f g h

Then save to excel.

But I can only access that "number" table. Is there a hint? Thanks!

Upvotes: 0

Views: 2166

Answers (2)

t.m.adam
t.m.adam

Reputation: 15376

You could use find_all and select only tables that don't have a specific attribute.

f2 = soup2.find_all('table', {'cellpadding':None})

Or if you want to select tables that have absolutely no attributes:

f2 = [tbl for tbl in soup2.find_all('table') if not tbl.attrs]


Then you can make a list of columns from f2 and pass it to the dataframe .

data = [ 
    [td.text for td in tr.find_all('td')] 
    for table in f2 for tr in table.find_all('tr') 
]

Upvotes: 2

htn
htn

Reputation: 301

You can use has_attr method to test whether table contains the cellpadding attribute:

soup2 = BeautifulSoup(HTML2, 'html.parser')
f2 = soup2.find_all('table')
for div in f2:
    if not div.has_attr('cellpadding'):
        row = ''
        rows = div.findAll('tr')
        for row in rows:
            if(row.text.find('td') != False):
                print(row.text)

Upvotes: 1

Related Questions