Mars Lee
Mars Lee

Reputation: 1945

How to deal with the td tag whose colspan == ''?

I'm trying to extract data from html source using BeautifulSoup. This is the source

<td class="advisor" colspan="">

Here is my code:

soup = BeautifulSoup(html, 'html.parser')
tds = soup.find_all('td')

for td in tds:
    if td["colspan"] == '':
        col = 0
    else:
        col = int(td["colspan"])

However, I get this error:

ValueError: invalid literal for int() with base 10: ''

I know this error means '' cannot be transformed to integer, but why doesn't my 'if' work? I think this situation should go to

col = 0

rather than

col = int(td["colspan"])

Upvotes: 1

Views: 1813

Answers (3)

Martin Evans
Martin Evans

Reputation: 46759

I would suggest you use exception handling as follows:

from bs4 import BeautifulSoup

html = """
    <td class="advisor" colspan="2"></td>
    <td class="advisor" colspan=""></td>
    <td class="advisor" colspan="x"></td>
    """

soup = BeautifulSoup(html, 'html.parser')
tds = soup.find_all('td')

for td in tds:
    try:
        col = int(td["colspan"])
    except (ValueError, KeyError) as e:
        col = 0

    print(col)

This would display the following:

2
0
0

Tested using Python 3.4.3

Upvotes: 2

Mauro Baraldi
Mauro Baraldi

Reputation: 6575

You can assume that the value of var col is allways "", then check if it's true.

soup = BeautifulSoup(html, 'html.parser')
tds = soup.find_all('td')

for td in tds:
    col = 0
    if td["colspan"].isdigit():
        col = int(td["colspan"])

Upvotes: 0

Ian
Ian

Reputation: 30813

To avoid having error due to wrong input type, you could check if the argument is really integer first before you proceed:

def check_int(s):
    if s = '' or s is None
        return False
    st = str(s)
    if st[0] in ('-', '+'):
        return st[1:].isdigit()
    return st.isdigit()

for td in tds:
    if check_int(td["colspan"]):
        col = int(td["colspan"])
    else:
        col = 0

Or, using ternary operation:

for td in tds:
    col = int(td["colspan"]) if check_int(td["colspan"]) else 0

Edit: some good materials to do int checking without try-except.

Upvotes: 1

Related Questions