Daniel
Daniel

Reputation: 691

Extract text of element line by line

I am using BeautifulSoup to extract various elements from a website. I have run across a situation for which I am unable to determine an answer. I want to extract the text of a link, but the link is line broken over 3 lines. For example:

<span class="location-address">
<a href="https://www.google.com/maps" target="_blank">
"123 Main St"
<br>
"Suite 456" 
<br> 
"Everywhere, USA 12345"
</a>

When I use find_all("span",{"class":"location-address"})[0].text I am given something like "123 Main StSuite 456Everywhere, USA 12345" and I would prefer a more natural response.

Upvotes: 0

Views: 67

Answers (2)

Sede
Sede

Reputation: 61225

If you only have one span tag with class=location-address then simply use the find() method.

>>> from bs4 import BeautifulSoup
>>> html = """<span class="location-address">
... <a href="https://www.google.com/maps" target="_blank">
... "123 Main St"
... <br>
... "Suite 456" 
... <br> 
... "Everywhere, USA 12345"
... </a>"""
>>> soup = BeautifulSoup(html, 'lxml')
>>> soup.find('span', class_='location-address').find_next('a').get_text(strip=True).replace('"', '')
'123 Main StSuite 456Everywhere, USA 12345'

But if you have more than one "span" tag with the given class, using the find_all() method you can do something like this:

>>> for span in soup.find_all('span', class_='location-address'):
...     span.find('a').get_text(strip=True).replace('"', '')
... 
'123 Main StSuite 456Everywhere, USA 12345'

Or use a css selector:

>>> for a in soup.select('span.location-address > a'):
...     a.get_text(strip=True).replace('"', '')
... 
'123 Main StSuite 456Everywhere, USA 12345'

Upvotes: 0

max
max

Reputation: 2817

You may try to get find_all("span",{"class":"location-address")[0].contents instead of find_all("span",{"class":"location-address")[0].text. It should return all html content within link tag. Then you may replace <br /> with \n or do whatever you need.

Upvotes: 1

Related Questions