Reputation: 4010
This is the source code layout from the website:
<div class="address">
<a href="https://website.ca/classifieds/59-barclay-street/">
59 Some Street<br />City, Zone 1
</a>
</div>
I would like to get the street number, route, and city for Google Geocoding. If I do this
>>>article.find('div', {'class': 'address'}).text
'59 Some StreetCity, Zone 1'
It takes away the <br />
and I'm left with no way to split the route from the city. If I do str().replace('<br />',', ')
then I have to somehow convert it back again to whatever type it was before so I can do .text
to get the actual text between the <a href>
, it's inefficient. I'd like to use the functionality that .text
uses to get the actual text, without the functionality where it removes the <br>
stuff. I couldn't find a file called BeautifulSoup.py
in my env, so I'm looking at the BeautifulSoup source code on GitHub, and I can't find a def text
in there, I don't know where else to look.
Update:
articles = page_soup.find('h2', text='Ads').find_next_siblings('article')
for article in articles:
link = article.find('a')
br = link.find('br')
ad_address = br.previous_sibling.strip() + ', ' + br.next_sibling.strip().partition(', Zone ')[0]
#ad_address = link.br.replace_with(', ').get_text().strip().partition(', Zone ')
Upvotes: 1
Views: 4115
Reputation: 474201
You can locate the br
delimiter tag and get the siblings around it:
In [4]: br = soup.select_one("div.address > a > br")
In [5]: br.previous_sibling.strip()
Out[5]: u'59 Some Street'
In [6]: br.next_sibling.strip()
Out[6]: u'City, Zone 1'
You may also locate the br
element and replace it with a space using replace_with()
:
In [4]: a = soup.select_one("div.address > a")
In [5]: a.br.replace_with(" ")
In [6]: a.get_text().strip()
Out[6]: u'59 Some Street City, Zone 1'
Or, you can join all text nodes inside the a
tag:
In [7]: a = soup.select_one("div.address > a")
In [8]: " ".join(a.find_all(text=True)).strip()
Out[8]: u'59 Some Street City, Zone 1'
Upvotes: 4
Reputation: 590
Try:
soup.find('div', {'class':'address'}).get_text(separator=u"<br/>").split(u'<br/>')
The separator keyword defines inner HTML which concatenates text.
http://omz-software.com/pythonista/docs/ios/beautifulsoup_ref.html
Upvotes: 2