Reputation: 373
So, I've working on crawling with BeautifulSoup, but I've encountered some messy html tags.
This is an example for that:
<html>
<body>
<p>Hey</p>
<div>
<div>
<span class="date">0817</span>
</div>
</div>
<p>I want all of those</p>
<div>
<div>
<p>But I want to get those separately</p>
<div>
</div>
<p>Hope this work</p>
</body>
</html>
So if I use code like this:
soup = BeautifulSoup(html,'html.parser')
body = soup.find("body")
print(body.text)
I'll probably get this:
"Hey0817I want all of thoseBut I want to get those separatelyHope this work"
The question is, can I get those texts with some strings as a separators? Separators to separate things between other tags Like:
"@@@Hey@@@0817@@@Iwant all of those@@@But I want to get those separately@@@Hope this work"
or
"Hey@@@0817@@@Iwant all of those@@@But I want to get those separately@@@Hope this work@@@"
or
"Hey@@@0817@@@Iwant all of those@@@But I want to get those separately@@@Hope this work"
So that I can sperate those texts by those "@@@" later with other codes? Or is there any walkaround doing similar things? Any advice would be greatly helpful. Thanks for your kind interest and times! Hope you can enlighten me.
Upvotes: 0
Views: 31
Reputation: 2603
If you want a list, you can use:
item_text = [t.text for t in body.find_all()]
if you really want the separators:
body.get_text('@@@')
Upvotes: 1
Reputation: 15480
I will use .get_text
:
soup.body.get_text('@@@')
A strip will be better:
soup.body.get_text('@@@').strip()
You can get the newlines expanded too:
print(soup.body.get_text('@@@').strip())
Upvotes: 1