Reputation: 1945
I'm trying to extract data from html source using BeautifulSoup. This is the source
<td class="advisor" colspan="">
Here is my code:
soup = BeautifulSoup(html, 'html.parser')
tds = soup.find_all('td')
for td in tds:
if td["colspan"] == '':
col = 0
else:
col = int(td["colspan"])
However, I get this error:
ValueError: invalid literal for int() with base 10: ''
I know this error means '' cannot be transformed to integer, but why doesn't my 'if' work? I think this situation should go to
col = 0
rather than
col = int(td["colspan"])
Upvotes: 1
Views: 1813
Reputation: 46759
I would suggest you use exception handling as follows:
from bs4 import BeautifulSoup
html = """
<td class="advisor" colspan="2"></td>
<td class="advisor" colspan=""></td>
<td class="advisor" colspan="x"></td>
"""
soup = BeautifulSoup(html, 'html.parser')
tds = soup.find_all('td')
for td in tds:
try:
col = int(td["colspan"])
except (ValueError, KeyError) as e:
col = 0
print(col)
This would display the following:
2
0
0
Tested using Python 3.4.3
Upvotes: 2
Reputation: 6575
You can assume that the value of var col
is allways ""
, then check if it's true.
soup = BeautifulSoup(html, 'html.parser')
tds = soup.find_all('td')
for td in tds:
col = 0
if td["colspan"].isdigit():
col = int(td["colspan"])
Upvotes: 0
Reputation: 30813
To avoid having error due to wrong input type, you could check if the argument is really integer first before you proceed:
def check_int(s):
if s = '' or s is None
return False
st = str(s)
if st[0] in ('-', '+'):
return st[1:].isdigit()
return st.isdigit()
for td in tds:
if check_int(td["colspan"]):
col = int(td["colspan"])
else:
col = 0
Or, using ternary operation:
for td in tds:
col = int(td["colspan"]) if check_int(td["colspan"]) else 0
Edit: some good materials to do int checking without try-except.
Upvotes: 1