Borja_042
Borja_042

Reputation: 1071

Parsing diferent bs4.element.Tag with beautifulSoup

I want to parse the table in this url and export it as a csv:

http://www.bde.es/webbde/es/estadis/fi/ifs_es.html

if i do this:

sauce = urlopen(url_bank).read()
soup = bs.BeautifulSoup(sauce, 'html.parser')

and then this:

resto = soup.find_all('td')
lista_text = []
for elements in resto:
    lista_text = lista_text + [elements.string]

I get all the elements well parsed except the last column 'Códigos Isin' and this is because there is a break on html code '. I do not know what to do with, i have tried this part but still does not work:

lista_text = lista_text + [str(elements.string).replace('<br/>','')]

After that I take the list to a np.array an then to a dataframe to export it as .csv. That part is already done, I only have to fix that issue.

Thanks in advance!

Upvotes: 3

Views: 3677

Answers (1)

alecxe
alecxe

Reputation: 473873

It's just that you need to be careful about what .string does - if there are multiple children elements, it would return None - as in the case with <br>:

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None

Use .get_text() instead:

for elements in resto:
    lista_text = lista_text + [elements.get_text(strip=True)]

Upvotes: 4

Related Questions