Mika Schiller
Mika Schiller

Reputation: 425

BeautifulSoup: pulling a tag preceding another tag

I'm pulling lists on webpages and to give them context, I'm also pulling the text immediately preceding them. Pulling the tag preceding the <ul> or <ol> tag seems to be the best way. So let's say I have this list:

enter image description here

I'd want to pull the bullet and word "Millennials". I use a BeautifulSoup function:

#pull <ul> tags
def pull_ul(tag):
    return tag.name == 'ul' and tag.li and not tag.attrs and not tag.li.attrs and not tag.a 
ul_tags = webpage.find_all(pull_ul)
#find text immediately preceding any <ul> tag and append to <ul> tag 
ul_with_context = [str(ul.previous_sibling) + str(ul) for ul in ul_tags]

When I print ul_with_context, I get the following:

['\n<ul>\n<li>With immigration adding more numbers to its group than any other, the Millennial population is projected to peak in 2036 at 81.1 million. Thereafter the oldest Millennial will be at least 56 years of age and mortality is projected to outweigh net immigration. By 2050 there will be a projected 79.2 million Millennials.</li>\n</ul>']

As you can see, "Millennials" wasn't pulled. The page I'm pulling from is http://www.pewresearch.org/fact-tank/2016/04/25/millennials-overtake-baby-boomers/ Here's the section of code for the bullet:

enter image description here

The <p> and <ul> tags are siblings. Any idea why it's not pulling the tag with the word "Millennials" in it?

Upvotes: 0

Views: 1156

Answers (1)

A-y
A-y

Reputation: 793

Previous_sibling will return elements or strings preceding the tag. In your case, it returns the string '\n'.

Instead, you could use the findPrevious method to get the node preceding what you selected:

doc = """
<h2>test</h2>
<ul>
    <li>1</li>
    <li>2</li>
</ul>
"""

soup = BeautifulSoup(doc, 'html.parser')    
tags = soup.find_all('ul')


print [ul.findPrevious() for ul in tags]
print tags

will output :

[<h2>test</h2>]
[<ul><li>1</li><li>2</li></ul>]

Upvotes: 1

Related Questions