Nik P
Nik P

Reputation: 89

Closing <p> tags in badly formatted html with beautifulsoup 4

I've got one not so valid set of html-pages to scrape. The data I need is in "p" tags. However, most of them aren't closed:

<p>Bla-bla-bla
<p>bla bla
<p>more bla-bla
<p><span class="some_class">another bla</span>
<p>just some more bla bla bla
<div class="another_class"></div>
<script>
<p>here's some more </p>

so when I perform a search, it gives me a messy resultSet of accumulated data:

In [2]: html='''
<p>Bla-bla-bla
<p>bla bla
<p>more bla-bla
<p><span class="some_class">another bla</span>
<p>just some more bla bla bla
<div class="another_class"></div>
<script>
<p>here's some more </p>'''

In [3]: from bs4 import BeautifulSoup

In [4]: soup = BeautifulSoup(html, "html.parser")

In [5]: p = soup.find_all('p')

In [6]: len(p)
Out[6]: 5

In [7]: p[0]
Out[7]: 
<p>Bla-bla-bla
<p>bla bla
<p>more bla-bla
<p><span class="some_class">another bla</span>
<p>just some more bla bla bla
<div class="another_class"></div>
<script></script></p></p></p></p></p>

In [8]: p[1]
Out[8]: 
<p>bla bla
<p>more bla-bla
<p><span class="some_class">another bla</span>
<p>just some more bla bla bla
<div class="another_class"></div>
<script></script></p></p></p></p>

In [9]: p[2]
Out[9]: 
<p>more bla-bla
<p><span class="some_class">another bla</span>
<p>just some more bla bla bla
<div class="another_class"></div>
<script></script></p></p></p>

I guess the default 'html.parser' just closes all tags at the end of the input string no matter what tags there are. In my case I would like a parser to parse tags not so greedy so that I could get a list of paragraphs at the end of the day. Is there any obvious solution or I should deal with this accumulated set and clean it by, for instance, subsequent substracting of strings or something?

(Also soup loses the last "p" - the only one that is formatted correctly, that's pretty weird.)

Upvotes: 0

Views: 579

Answers (3)

nil
nil

Reputation: 63

From bs4 docs:

Another alternative is the pure-Python html5lib parser, which parses HTML the way a web browser does.

So:

pip install html5lib

And then

In [14]: soup = BeautifulSoup(html, "html5lib")

In [15]: p = soup.find_all('p')

In [17]: p[0]
Out[17]: <p>Bla-bla-bla\n</p>

The last paragraph is still lost, however:

In [18]: len(p)
Out[18]: 5

In [19]: p2
Out[19]: 
[<p>Bla-bla-bla\n</p>,
 <p>bla bla\n</p>,
 <p>more bla-bla\n</p>,
 <p><span class="some_class">another bla</span>\n</p>,
 <p>just some more bla bla bla\n</p>]

Upvotes: 2

xystum
xystum

Reputation: 1009

If every p tag has its own line, you could strip whitespace from the input text (to prevent a blank line at the end) and then try:

Search: (?<!(div|script|p)>)$

Replace: </p>

That will add a closing p tag to every line end, if the line doesn't end with a opening or closing div, script, or p tag. To exclude further tags (like table etc.), add them in the same manner:

(?<!(div|script|p|table|tr|td|th|section)>)$

etc.

Upvotes: 0

Yupsiree
Yupsiree

Reputation: 163

Have you tried:

html.replace("<p>", "</p><p>")

And then:

html.replace("</p><p>", "<p>", 1)

to clean up the first tag.

Upvotes: 0

Related Questions