Reputation: 691
I am using BeautifulSoup to extract various elements from a website. I have run across a situation for which I am unable to determine an answer. I want to extract the text of a link, but the link is line broken over 3 lines. For example:
<span class="location-address">
<a href="https://www.google.com/maps" target="_blank">
"123 Main St"
<br>
"Suite 456"
<br>
"Everywhere, USA 12345"
</a>
When I use find_all("span",{"class":"location-address"})[0].text
I am given something like "123 Main StSuite 456Everywhere, USA 12345" and I would prefer a more natural response.
Upvotes: 0
Views: 67
Reputation: 61225
If you only have one span
tag with class=location-address
then simply use the find()
method.
>>> from bs4 import BeautifulSoup
>>> html = """<span class="location-address">
... <a href="https://www.google.com/maps" target="_blank">
... "123 Main St"
... <br>
... "Suite 456"
... <br>
... "Everywhere, USA 12345"
... </a>"""
>>> soup = BeautifulSoup(html, 'lxml')
>>> soup.find('span', class_='location-address').find_next('a').get_text(strip=True).replace('"', '')
'123 Main StSuite 456Everywhere, USA 12345'
But if you have more than one "span" tag with the given class, using the find_all()
method you can do something like this:
>>> for span in soup.find_all('span', class_='location-address'):
... span.find('a').get_text(strip=True).replace('"', '')
...
'123 Main StSuite 456Everywhere, USA 12345'
Or use a css selector:
>>> for a in soup.select('span.location-address > a'):
... a.get_text(strip=True).replace('"', '')
...
'123 Main StSuite 456Everywhere, USA 12345'
Upvotes: 0
Reputation: 2817
You may try to get find_all("span",{"class":"location-address")[0].contents
instead of find_all("span",{"class":"location-address")[0].text
. It should return all html content within link tag. Then you may replace <br />
with \n
or do whatever you need.
Upvotes: 1