Chris_353
Chris_353

Reputation: 43

BeautifulSoup - how to extract text without opening tag and before <br> tag?

I'm new to python and beautifulsoup and spent quite a few hours trying to figure this one out.
I want to extract three particular text extracts within a <div> that has no class.
The first text extract I want is within an <a> tag which is within an <h4> tag. This I managed to extract it.
The second text extract immediately follows the closing h4 tag </h4> and is followed by a <br> tag.
The third text extract immediately follows the <br> tag after the second text extract and is also followed by a <br> tag.

Here the html extract I work with:

<div>
    <h4 class="actorboxLink">
    <a href="/a-decheterie-de-bagnols-2689">Decheterie de Bagnols</a>
    </h4>
    Route des 4 Vents<br>
    63810 Bagnols<br>
</div>

I want to extract:

Decheterie de Bagnols < That works

Route des 4 Vents < Doesn't work

63810 Bagnols < Doesn't work

Here is the code I have so far:

import urllib
from bs4 import BeautifulSoup    
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
name = soup.findAll("h4", class_="actorboxLink")

for a_tag in name:
    print a_tag.text.strip()

I need something like "soup.findAll(all text after </h4>)"

I played with using .next_sibling but I can't get it to work.

Any ideas? Thanks

UPDATE:
I tried this:

for a_tag in classActorboxLink:
    print a_tag.find_all_next(string=True, limit=5) 

which gives me:
[u'\n', u'\r\n\t\t\t\t\t\tDecheterie\xa0de\xa0Bagnols\t\t\t\t\t', u'\n', u'\r\n\t\t\t\tRoute\xa0des\xa04\xa0Vents', u'\r\n\t\t\t\t63810 Bagnols']

It's a start but I need to relove all the whitespaces and unecessary characters. I tried using .strip(),.strings and .stripped_strings but it doesn't work. Examples:

for a_tag in classActorboxLink.strings

for a_tag in classActorboxLink.stripped_strings

print a_tag.find_all_next(string=True, limit=5).strip() 

For all three I get:

AttributeError: 'ResultSet' object has no attribute 'strings/stripped_strings/strip'

Upvotes: 4

Views: 3826

Answers (2)

alecxe
alecxe

Reputation: 474211

Locate the h4 element and use find_next_siblings():

h4s = soup.find_all("h4", class_="actorboxLink")
for h4 in h4s:
    for text in h4.find_next_siblings(text=True):
        print(text.strip())

Upvotes: 2

dstudeba
dstudeba

Reputation: 9048

If you don't need each of the 3 elements you are looking for in different variables you could just use the get_text() function on the <div> to get them all in one string. If there are other div tags but they all have classes you can find all the <div> with class=false. If you can't isolate the <div> that you are interested in then this solution won't work for you.

import urllib
from bs4 import BeautifulSoup    
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")

for name in soup.find_all("div", class=false)
     print name.get_text().strip()

BTW this is python 3 & bs4

Upvotes: 1

Related Questions