Reputation: 4010

BS4 How to get text without using .text?

This is the source code layout from the website:

<div class="address">
    <a href="https://website.ca/classifieds/59-barclay-street/">
        59 Some Street<br />City, Zone 1
    </a>
</div>

I would like to get the street number, route, and city for Google Geocoding. If I do this

>>>article.find('div', {'class': 'address'}).text
'59 Some StreetCity, Zone 1'

It takes away the <br /> and I'm left with no way to split the route from the city. If I do str().replace('<br />',', ') then I have to somehow convert it back again to whatever type it was before so I can do .text to get the actual text between the <a href>, it's inefficient. I'd like to use the functionality that .text uses to get the actual text, without the functionality where it removes the <br> stuff. I couldn't find a file called BeautifulSoup.py in my env, so I'm looking at the BeautifulSoup source code on GitHub, and I can't find a def text in there, I don't know where else to look.

Update:

articles = page_soup.find('h2', text='Ads').find_next_siblings('article')
for article in articles:
    link = article.find('a')
    br = link.find('br')
    ad_address = br.previous_sibling.strip() + ', ' + br.next_sibling.strip().partition(', Zone ')[0]
    #ad_address = link.br.replace_with(', ').get_text().strip().partition(', Zone ')

Upvotes: 1

Answers (3)

Harry1992

Reputation: 469

Try:

for link_to_text in links:
   Print link_to_text.get_text()

Upvotes: 0

alecxe

Reputation: 474201

You can locate the br delimiter tag and get the siblings around it:

In [4]: br = soup.select_one("div.address > a > br")

In [5]: br.previous_sibling.strip()
Out[5]: u'59 Some Street'

In [6]: br.next_sibling.strip()
Out[6]: u'City, Zone 1'

You may also locate the br element and replace it with a space using replace_with():

In [4]: a = soup.select_one("div.address > a")
In [5]: a.br.replace_with(" ")

In [6]: a.get_text().strip()
Out[6]: u'59 Some Street City, Zone 1'

Or, you can join all text nodes inside the a tag:

In [7]: a = soup.select_one("div.address > a")
In [8]: " ".join(a.find_all(text=True)).strip()
Out[8]: u'59 Some Street City, Zone 1'

Upvotes: 4

rask004

Reputation: 590

Try:

soup.find('div', {'class':'address'}).get_text(separator=u"<br/>").split(u'<br/>')

The separator keyword defines inner HTML which concatenates text.

http://omz-software.com/pythonista/docs/ios/beautifulsoup_ref.html

Upvotes: 2

BS4 How to get text without using .text?

Answers (3)

Related Questions