Acorn
Acorn

Reputation: 50497

Separating HTML into groups using BeautifulSoup when groups are all in the same element

Here's an example:

<p class='animal'>cats</p>
<p class='attribute'>they meow</p>
<p class='attribute'>they have fur</p>
<p class='animal'>turtles</p>
<p class='attribute'>they don't make noises</p>
<p class='attribute'>they have shells</p>

If each animal was in a separate element I could just iterate over the elements. That would be great. But the website I'm trying to parse has all the information in one element.

What would be the best way of either separating the soup into different animals, or to some other way extract the attributes and which animal they belong to?

(feel free to recommend a better title)

Upvotes: 1

Views: 359

Answers (2)

John La Rooy
John La Rooy

Reputation: 304137

If you don't need to keep the animal names in order you can simplify Jamie's answer like this

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup("""
<p class='animal'>cats</p>
<p class='attribute'>they meow</p>
<p class='attribute'>they have fur</p>
<p class='animal'>turtles</p>
<p class='attribute'>they don't make noises</p>
<p class='attribute'>they have shells</p>
""")

attributes = {}

for p in soup.findAll('p'):
    if (p['class'] == 'animal'):
        animal = p.string
        attributes[animal] = []
    elif (p['class'] == 'attribute'):
        attributes[animal].append(p.string)

print attributes.keys()
print attributes

Upvotes: 2

Jamie Wong
Jamie Wong

Reputation: 18350

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup("""
<p class='animal'>cats</p>
<p class='attribute'>they meow</p>
<p class='attribute'>they have fur</p>
<p class='animal'>turtles</p>
<p class='attribute'>they don't make noises</p>
<p class='attribute'>they have shells</p>
""")

animals = []
attributes = {}

for p in soup.findAll('p'):
    if (p['class'] == 'animal'):
        animals.append(p.string)
    elif (p['class'] == 'attribute'):
        if animals[-1] not in attributes.keys():
            attributes[animals[-1]] = [p.string]
        else:
            attributes[animals[-1]].append(p.string)

print animals
print attributes

That should work.

Upvotes: 2

Related Questions