Reputation: 89
I've got one not so valid set of html-pages to scrape. The data I need is in "p" tags. However, most of them aren't closed:
<p>Bla-bla-bla
<p>bla bla
<p>more bla-bla
<p><span class="some_class">another bla</span>
<p>just some more bla bla bla
<div class="another_class"></div>
<script>
<p>here's some more </p>
so when I perform a search, it gives me a messy resultSet of accumulated data:
In [2]: html='''
<p>Bla-bla-bla
<p>bla bla
<p>more bla-bla
<p><span class="some_class">another bla</span>
<p>just some more bla bla bla
<div class="another_class"></div>
<script>
<p>here's some more </p>'''
In [3]: from bs4 import BeautifulSoup
In [4]: soup = BeautifulSoup(html, "html.parser")
In [5]: p = soup.find_all('p')
In [6]: len(p)
Out[6]: 5
In [7]: p[0]
Out[7]:
<p>Bla-bla-bla
<p>bla bla
<p>more bla-bla
<p><span class="some_class">another bla</span>
<p>just some more bla bla bla
<div class="another_class"></div>
<script></script></p></p></p></p></p>
In [8]: p[1]
Out[8]:
<p>bla bla
<p>more bla-bla
<p><span class="some_class">another bla</span>
<p>just some more bla bla bla
<div class="another_class"></div>
<script></script></p></p></p></p>
In [9]: p[2]
Out[9]:
<p>more bla-bla
<p><span class="some_class">another bla</span>
<p>just some more bla bla bla
<div class="another_class"></div>
<script></script></p></p></p>
I guess the default 'html.parser' just closes all tags at the end of the input string no matter what tags there are. In my case I would like a parser to parse tags not so greedy so that I could get a list of paragraphs at the end of the day. Is there any obvious solution or I should deal with this accumulated set and clean it by, for instance, subsequent substracting of strings or something?
(Also soup loses the last "p" - the only one that is formatted correctly, that's pretty weird.)
Upvotes: 0
Views: 579
Reputation: 63
Another alternative is the pure-Python html5lib parser, which parses HTML the way a web browser does.
So:
pip install html5lib
And then
In [14]: soup = BeautifulSoup(html, "html5lib")
In [15]: p = soup.find_all('p')
In [17]: p[0]
Out[17]: <p>Bla-bla-bla\n</p>
The last paragraph is still lost, however:
In [18]: len(p)
Out[18]: 5
In [19]: p2
Out[19]:
[<p>Bla-bla-bla\n</p>,
<p>bla bla\n</p>,
<p>more bla-bla\n</p>,
<p><span class="some_class">another bla</span>\n</p>,
<p>just some more bla bla bla\n</p>]
Upvotes: 2
Reputation: 1009
If every p
tag has its own line, you could strip whitespace from the input text (to prevent a blank line at the end) and then try:
Search: (?<!(div|script|p)>)$
Replace: </p>
That will add a closing p
tag to every line end, if the line doesn't end with a opening or closing div
, script
, or p
tag. To exclude further tags (like table
etc.), add them in the same manner:
(?<!(div|script|p|table|tr|td|th|section)>)$
etc.
Upvotes: 0
Reputation: 163
Have you tried:
html.replace("<p>", "</p><p>")
And then:
html.replace("</p><p>", "<p>", 1)
to clean up the first tag.
Upvotes: 0