Karthi1234
Karthi1234

Reputation: 1017

beautifulsoup Extract text which doesn't have tag

I have HTML parsed text as below and trying to extract the texts in the same order.

<b>
 <i>
  Data
 </i>
 Data Summary
</b>
<br/>
Data Description
<br/>
<br/>
<pre>Data paragraph which contains huge string<br/></pre>
<br/>
<br/>
<pre></pre>
<br/>
<br/>
<b>
 <i>
  Data 2
 </i>
 Data 2 Summary
</b>
<br/>
Data 2 Description
<br/>
<br/>
<pre>Data 2 paragraph which contains huge string<br/></pre>
<br/>
<br/>

Am able to extract the between the tags i and b using soup.findAll(['b', 'i']) but am struggling to get the text without tags comes after every b tag. I have tried with next_sibling which doesn't even work with this. Any help would be appreciated.

The expected result is:

Data Summary : Data Description : Data paragraph which contains huge string newline Data 2 : Data 2 Summary : Data 2 Description : Data 2 paragraph which contains huge string

Upvotes: 1

Views: 827

Answers (1)

Martin Evans
Martin Evans

Reputation: 46759

You could iterate over all of the elements that contain text as follows:

from bs4 import BeautifulSoup

html = """
<b><i>Data</i>Data Summary</b><br/>
Data Description<br/>
<br/>
<pre>Data paragraph which contains huge string<br/></pre>
<br/>
<br/>
<pre></pre>
<br/>
<br/>

<b><i>Data 2</i>Data 2 Summary</b><br/>
Data 2 Description<br/>
<br/>
<pre>Data 2 paragraph which contains huge string<br/></pre>
<br/>
<br/>"""

soup = BeautifulSoup(html, "html.parser")
text_items = [t.strip() for t in soup.find_all(text=True) if len(t.strip())]
print(text_items)

This also strips any whitespace and only stores items that result in non empty strings. It would display the following list:

['Data', 'Data Summary', 'Data Description', 'Data paragraph which contains huge string', 'Data 2', 'Data 2 Summary', 'Data 2 Description', 'Data 2 paragraph which contains huge string']    

Upvotes: 1

Related Questions