BeautifulSoup Extract striped Text without Tags

Question

I'm trying to parse content from site (from table) and print only text from node, i'm using .text.strip() but it's doesn't works correct.

My code:

import requests
from bs4 import BeautifulSoup

r = requests.get('http://examplesite.net')
soup = BeautifulSoup(r.content, 'lxml')


builddata = soup.find('table', {'id':'BuildData'})

table_elements = builddata.find_all('tr')
for element in table_elements:
    element_dict = {'element_name':element.findChildren()[0].text.strip(), 'element_value':element.findChildren()[1].text.strip()}
    print(element_dict)

Result:

{'element_value': 'Студия;                                                 1-к кв;                                                 2-к кв;                                                 3-к кв;                                                 4-к кв', 
{'element_value': 'Квартира у воды,     		       		Зеленая зона', 'element_name': 'Особенности:'}

lines with problems, should looks like:

{'element_value': 'Студия; 1-к кв; 2-к кв; 3-к кв; 4-к кв', 
{'element_value': 'Квартира у воды, Зеленая зона', 'element_name': 'Особенности:'}

what i'm doing wrong?

alecxe · Accepted Answer

You should be using get_text() with strip=True:

for element in table_elements:
    name, value = element.find_all("td")[:2]

    element_dict = {
        'element_name': name.get_text(strip=True),
        'element_value': ' '.join(value.get_text(strip=True, separator=" ").split())
    }
    print(element_dict)

Also, see how I've approached reading the cell values in the code above - using find_all() instead of findChildren() and unpacking the cells into name and value pairs.

Note that one of values should be handled "manually" - the "Цена за кв. метр:" one has multiple spaces - we can replace them with a single one.

Prints:

{'element_name': 'Район:', 'element_value': 'САО (МСК)'}
{'element_name': 'Метро:', 'element_value': 'Речной Вокзал , Петровско-Разумовская'}
{'element_name': 'До метро:', 'element_value': '5.9 км (18 мин на машине) (Посмотреть маршрут)'}
{'element_name': 'Адрес:', 'element_value': 'Дмитровское шоссе, 107 (Посмотреть на карте)'}
...
{'element_name': 'Разрешение на строительство:', 'element_value': 'Есть'}
{'element_name': 'Обновлено:', 'element_value': '19 Декабря 2016'}
{'element_name': 'Особенности:', 'element_value': 'Квартира у воды , Зеленая зона'}

As a side note, if you'll be dealing with tabular HTML structures more during the HTML parsing, see if loading them into pandas.DataFrame objects with pandas.read_html() would be more convenient than trying to manually parse the tables with BeautifulSoup.

BeautifulSoup Extract striped Text without Tags

Answers (2)

Related Questions