JavaCake
JavaCake

Reputation: 4115

Splitting up text with BeautifulSoup by certain HTML structures

I am attempting to split up some HTML based on a certain pattern.

A specific part of the HTML has to be divided into 1 or more parts or text arrays. The way that I am able to divide this HTML is by looking at the first <strong> and a double <br />. All the text that is between these two tags has to be put into a list and iterated over.

How can this easily be solved?

So I want the following HTML:

<div class="clearfix">
    <!--# of ppl associated with place-->
        This is some kind of buzzword:<br />
    <br />
    <!--Persontype-->
        <strong>Jimbo</strong> Jack            <br />
Some filler text            <br />
More weird stuff
            <br />
Unstructured text        <br />
        <br />
        <strong>Jacky</strong> Bradson            <br />
This is just a test            <br />
Nothing but a test
            <br />
More unstructured stuff        <br />
        <br />
        <strong>Junior</strong> Bossman            <br />
This is fluffy
            <br />
As I would expect        <br />
        <br />
</div>

Split into following portions.

First part:

        <strong>Jimbo</strong> Jack            <br />
Some filler text            <br />
More weird stuff
            <br />
Unstructured text        <br />
        <br />

Second part:

        <strong>Jacky</strong> Bradson            <br />
This is just a test            <br />
Nothing but a test
            <br />
More unstructured stuff        <br />
        <br />

Third part:

        <strong>Junior</strong> Bossman            <br />
This is fluffy
            <br />
As I would expect        <br />
        <br />
</div>

Upvotes: 2

Views: 15330

Answers (1)

aberna
aberna

Reputation: 5814

Basic solution is by using join, prettify and split. Basic idea is to convert it in a text and separate the portion of interest.

from bs4 import BeautifulSoup
soup = BeautifulSoup(''.join(text))
for i in soup.prettify().split('<!--Persontype-->')[1].split('<strong>'):
    print '<strong>' + ''.join(i)

The text file is:

text= '''
<div class="clearfix">
    <!--# of ppl associated with place-->
        This is some kind of buzzword:<br />
    <br />
    <!--Persontype-->
        <strong>Jimbo</strong> Jack            <br />
Some filler text            <br />
More wierd stuff
            <br />
Unstructured text        <br />
        <br />
        <strong>Jacky</strong> Bradson            <br />
This is just a test            <br />
Nothing but a test
            <br />
More unstructured stuff        <br />
        <br />
        <strong>Junior</strong> Bossman            <br />
This is fluffy
            <br />
As i would expect        <br />
        <br />
</div>
'''

Output is:

Jimbo Jack
Some filler text
More wierd stuff
Unstructured text

Jacky Bradson
This is just a test
Nothing but a test
More unstructured stuff


Junior Bossman
This is fluffy
As i would expect

Upvotes: 5

Related Questions