Jeong In Kim
Jeong In Kim

Reputation: 373

With BeautifulSoup, can I get texts with other strings between tags to seprate among those?

So, I've working on crawling with BeautifulSoup, but I've encountered some messy html tags.

This is an example for that:

<html>
    <body>
        <p>Hey</p>
        <div>
            <div>
                <span class="date">0817</span>
            </div>
        </div>
        <p>I want all of those</p>
        <div>
            <div>
                <p>But I want to get those separately</p>
            <div>
        </div>
        <p>Hope this work</p>
    </body>
</html>

So if I use code like this:

soup = BeautifulSoup(html,'html.parser')
body = soup.find("body")
print(body.text)

I'll probably get this:

"Hey0817I want all of thoseBut I want to get those separatelyHope this work"

The question is, can I get those texts with some strings as a separators? Separators to separate things between other tags Like:

"@@@Hey@@@0817@@@Iwant all of those@@@But I want to get those separately@@@Hope this work"
or
"Hey@@@0817@@@Iwant all of those@@@But I want to get those separately@@@Hope this work@@@"
or
"Hey@@@0817@@@Iwant all of those@@@But I want to get those separately@@@Hope this work"

So that I can sperate those texts by those "@@@" later with other codes? Or is there any walkaround doing similar things? Any advice would be greatly helpful. Thanks for your kind interest and times! Hope you can enlighten me.

Upvotes: 0

Views: 31

Answers (2)

Louic
Louic

Reputation: 2603

If you want a list, you can use:

item_text = [t.text for t in body.find_all()]

if you really want the separators:

body.get_text('@@@')

Upvotes: 1

wasif
wasif

Reputation: 15480

I will use .get_text:

soup.body.get_text('@@@')

A strip will be better:

soup.body.get_text('@@@').strip()

You can get the newlines expanded too:

print(soup.body.get_text('@@@').strip())

Upvotes: 1

Related Questions