Reputation: 4115
I am attempting to split up some HTML based on a certain pattern.
A specific part of the HTML has to be divided into 1 or more parts or text arrays. The way that I am able to divide this HTML is by looking at the first <strong>
and a double <br />
. All the text that is between these two tags has to be put into a list and iterated over.
How can this easily be solved?
So I want the following HTML:
<div class="clearfix">
<!--# of ppl associated with place-->
This is some kind of buzzword:<br />
<br />
<!--Persontype-->
<strong>Jimbo</strong> Jack <br />
Some filler text <br />
More weird stuff
<br />
Unstructured text <br />
<br />
<strong>Jacky</strong> Bradson <br />
This is just a test <br />
Nothing but a test
<br />
More unstructured stuff <br />
<br />
<strong>Junior</strong> Bossman <br />
This is fluffy
<br />
As I would expect <br />
<br />
</div>
Split into following portions.
First part:
<strong>Jimbo</strong> Jack <br />
Some filler text <br />
More weird stuff
<br />
Unstructured text <br />
<br />
Second part:
<strong>Jacky</strong> Bradson <br />
This is just a test <br />
Nothing but a test
<br />
More unstructured stuff <br />
<br />
Third part:
<strong>Junior</strong> Bossman <br />
This is fluffy
<br />
As I would expect <br />
<br />
</div>
Upvotes: 2
Views: 15330
Reputation: 5814
Basic solution is by using join, prettify and split. Basic idea is to convert it in a text and separate the portion of interest.
from bs4 import BeautifulSoup
soup = BeautifulSoup(''.join(text))
for i in soup.prettify().split('<!--Persontype-->')[1].split('<strong>'):
print '<strong>' + ''.join(i)
The text file is:
text= '''
<div class="clearfix">
<!--# of ppl associated with place-->
This is some kind of buzzword:<br />
<br />
<!--Persontype-->
<strong>Jimbo</strong> Jack <br />
Some filler text <br />
More wierd stuff
<br />
Unstructured text <br />
<br />
<strong>Jacky</strong> Bradson <br />
This is just a test <br />
Nothing but a test
<br />
More unstructured stuff <br />
<br />
<strong>Junior</strong> Bossman <br />
This is fluffy
<br />
As i would expect <br />
<br />
</div>
'''
Output is:
Jimbo Jack
Some filler text
More wierd stuff
Unstructured text
Jacky Bradson
This is just a test
Nothing but a test
More unstructured stuff
Junior Bossman
This is fluffy
As i would expect
Upvotes: 5