Reputation: 43
I'm new to python and beautifulsoup and spent quite a few hours trying to figure this one out.
I want to extract three particular text extracts within a <div>
that has no class.
The first text extract I want is within an <a>
tag which is within an <h4>
tag. This I managed to extract it.
The second text extract immediately follows the closing h4 tag </h4>
and is followed by a <br>
tag.
The third text extract immediately follows the <br>
tag after the second text extract and is also followed by a <br>
tag.
Here the html extract I work with:
<div>
<h4 class="actorboxLink">
<a href="/a-decheterie-de-bagnols-2689">Decheterie de Bagnols</a>
</h4>
Route des 4 Vents<br>
63810 Bagnols<br>
</div>
I want to extract:
Decheterie de Bagnols < That works
Route des 4 Vents < Doesn't work
63810 Bagnols < Doesn't work
Here is the code I have so far:
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
name = soup.findAll("h4", class_="actorboxLink")
for a_tag in name:
print a_tag.text.strip()
I need something like "soup.findAll(all text after </h4>
)"
I played with using .next_sibling but I can't get it to work.
Any ideas? Thanks
UPDATE:
I tried this:
for a_tag in classActorboxLink:
print a_tag.find_all_next(string=True, limit=5)
which gives me:
[u'\n', u'\r\n\t\t\t\t\t\tDecheterie\xa0de\xa0Bagnols\t\t\t\t\t', u'\n', u'\r\n\t\t\t\tRoute\xa0des\xa04\xa0Vents', u'\r\n\t\t\t\t63810 Bagnols']
It's a start but I need to relove all the whitespaces and unecessary characters. I tried using .strip()
,.strings
and .stripped_strings
but it doesn't work. Examples:
for a_tag in classActorboxLink.strings
for a_tag in classActorboxLink.stripped_strings
print a_tag.find_all_next(string=True, limit=5).strip()
For all three I get:
AttributeError: 'ResultSet' object has no attribute 'strings/stripped_strings/strip'
Upvotes: 4
Views: 3826
Reputation: 474211
Locate the h4
element and use find_next_siblings()
:
h4s = soup.find_all("h4", class_="actorboxLink")
for h4 in h4s:
for text in h4.find_next_siblings(text=True):
print(text.strip())
Upvotes: 2
Reputation: 9048
If you don't need each of the 3 elements you are looking for in different variables you could just use the get_text()
function on the <div>
to get them all in one string. If there are other div
tags but they all have classes you can find all the <div>
with class=false
. If you can't isolate the <div>
that you are interested in then this solution won't work for you.
import urllib
from bs4 import BeautifulSoup
data = urllib.urlopen(url).read()
soup = BeautifulSoup(data, "html.parser")
for name in soup.find_all("div", class=false)
print name.get_text().strip()
BTW this is python 3 & bs4
Upvotes: 1