Konstantin Rusanov
Konstantin Rusanov

Reputation: 6554

BeautifulSoup Extract striped Text without Tags

I'm trying to parse content from site (from table) and print only text from node, i'm using .text.strip() but it's doesn't works correct.

My code:

import requests
from bs4 import BeautifulSoup

r = requests.get('http://examplesite.net')
soup = BeautifulSoup(r.content, 'lxml')


builddata = soup.find('table', {'id':'BuildData'})

table_elements = builddata.find_all('tr')
for element in table_elements:
    element_dict = {'element_name':element.findChildren()[0].text.strip(), 'element_value':element.findChildren()[1].text.strip()}
    print(element_dict)

Result:

{'element_value': 'Студия;                                                 1-к кв;                                                 2-к кв;                                                 3-к кв;                                                 4-к кв', 
{'element_value': 'Квартира у воды,     \t\t       \t\tЗеленая зона', 'element_name': 'Особенности:'}

lines with problems, should looks like:

{'element_value': 'Студия; 1-к кв; 2-к кв; 3-к кв; 4-к кв', 
{'element_value': 'Квартира у воды, Зеленая зона', 'element_name': 'Особенности:'}

what i'm doing wrong?

Upvotes: 2

Views: 2275

Answers (2)

alecxe
alecxe

Reputation: 474021

You should be using get_text() with strip=True:

for element in table_elements:
    name, value = element.find_all("td")[:2]

    element_dict = {
        'element_name': name.get_text(strip=True),
        'element_value': ' '.join(value.get_text(strip=True, separator=" ").split())
    }
    print(element_dict)

Also, see how I've approached reading the cell values in the code above - using find_all() instead of findChildren() and unpacking the cells into name and value pairs.

Note that one of values should be handled "manually" - the "Цена за кв. метр:" one has multiple spaces - we can replace them with a single one.

Prints:

{'element_name': 'Район:', 'element_value': 'САО (МСК)'}
{'element_name': 'Метро:', 'element_value': 'Речной Вокзал , Петровско-Разумовская'}
{'element_name': 'До метро:', 'element_value': '5.9 км (18 мин на машине) (Посмотреть маршрут)'}
{'element_name': 'Адрес:', 'element_value': 'Дмитровское шоссе, 107 (Посмотреть на карте)'}
...
{'element_name': 'Разрешение на строительство:', 'element_value': 'Есть'}
{'element_name': 'Обновлено:', 'element_value': '19 Декабря 2016'}
{'element_name': 'Особенности:', 'element_value': 'Квартира у воды , Зеленая зона'}

As a side note, if you'll be dealing with tabular HTML structures more during the HTML parsing, see if loading them into pandas.DataFrame objects with pandas.read_html() would be more convenient than trying to manually parse the tables with BeautifulSoup.

Upvotes: 3

brianpck
brianpck

Reputation: 8254

strip() removes trailing white space:

>>> '      test     test         '.strip()
'test     test'

In order to replace multiple white space characters with only one space, as you appear to do in your example, you can do something like the following:

>>> ' '.join('abc                 adsfdf                adsfsaf'.split())
'abc adsfdf adsfsaf'

Upvotes: 0

Related Questions