Reputation: 425
I'm pulling lists on webpages and to give them context, I'm also pulling the text immediately preceding them. Pulling the tag preceding the <ul>
or <ol>
tag seems to be the best way. So let's say I have this list:
I'd want to pull the bullet and word "Millennials". I use a BeautifulSoup function:
#pull <ul> tags
def pull_ul(tag):
return tag.name == 'ul' and tag.li and not tag.attrs and not tag.li.attrs and not tag.a
ul_tags = webpage.find_all(pull_ul)
#find text immediately preceding any <ul> tag and append to <ul> tag
ul_with_context = [str(ul.previous_sibling) + str(ul) for ul in ul_tags]
When I print ul_with_context, I get the following:
['\n<ul>\n<li>With immigration adding more numbers to its group than any other, the Millennial population is projected to peak in 2036 at 81.1 million. Thereafter the oldest Millennial will be at least 56 years of age and mortality is projected to outweigh net immigration. By 2050 there will be a projected 79.2 million Millennials.</li>\n</ul>']
As you can see, "Millennials" wasn't pulled. The page I'm pulling from is http://www.pewresearch.org/fact-tank/2016/04/25/millennials-overtake-baby-boomers/ Here's the section of code for the bullet:
The <p>
and <ul>
tags are siblings. Any idea why it's not pulling the tag with the word "Millennials" in it?
Upvotes: 0
Views: 1156
Reputation: 793
Previous_sibling
will return elements or strings preceding the tag. In your case, it returns the string '\n'
.
Instead, you could use the findPrevious method to get the node preceding what you selected:
doc = """
<h2>test</h2>
<ul>
<li>1</li>
<li>2</li>
</ul>
"""
soup = BeautifulSoup(doc, 'html.parser')
tags = soup.find_all('ul')
print [ul.findPrevious() for ul in tags]
print tags
will output :
[<h2>test</h2>]
[<ul><li>1</li><li>2</li></ul>]
Upvotes: 1