cashread
cashread

Reputation: 43

How to extract text from <span> nested in <li> which is nested in <ul> using BeautifulSoup?

I wanna extract Here’s what’s new section's items from this page, starting with In the coming weeks and ending with general enhancements.

Inspecting the code I see the <span> is nested under<li> which is then nested under <ul id="GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B">. I tried to extract it with Python 3 and BeautifulSoup for the last few days but to no avail. I'm pasting code I tried below.

Would somebody be so kind to guide me in the right direction?

1#

from urllib.request import urlopen # open URLs 
from bs4 import BeautifulSoup # BS

import sys # sys.exit() 

page_url = 'https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS'

try: 
    page = urlopen(page_url)
except: 
    sys.exit("No internet connection. Program exiting...")

soup = BeautifulSoup(page, 'html.parser')

try: 
    for ultag in soup.find_all('ul', {'id': 'GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B'}):
        print(ultag.text)
        for spantag in ultag.find_all('span'):
            print(spantag)
except:
    print("Couldn't get What's new :(")

2#

from urllib.request import urlopen # open URLs 
from bs4 import BeautifulSoup # BS

import sys # sys.exit() 

page_url = 'https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS'

try: 
    page = urlopen(page_url)
except: 
    sys.exit("No internet connection. Program exiting...")

soup = BeautifulSoup(page, 'html.parser')

uls = []
for ul in uls:
    for ul in soup.findAll('ul', {'id': 'GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B'}):
        if soup.find('ul'):
            break
        uls.append(ul)
    print(uls)
    for li in uls:
        print(li.text)

Ideally code should return:

In the coming weeks, you will be able to read items that you own with a single click from the ‘Before You Go’ dialog.

Performance improvements, bug fixes, and other general enhancements.

But both give me nothing. It looks like it can't find ul with that ID but if you print(soup) everything looks good:

<ul id="GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B">
<li>
<span class="a-list-item"><span><strong>Read Now</strong></span>: In the coming weeks, you will be able to read items that you own with a single click from the �Before You Go� dialog.</span></li>

<li>
<span class="a-list-item">Performance improvements, bug fixes, and other general enhancements.<br></li>


</ul>

Upvotes: 2

Views: 691

Answers (2)

QHarr
QHarr

Reputation: 84465

With bs4 4.7.1+ you can use :contains and :has to isolate

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS')
soup = bs(r.content, 'lxml')
text = [i.text.strip() for i in soup.select('p:has(strong:contains("Here’s what’s new:")), p:has(strong:contains("Here’s what’s new:")) + p + ul li')]
print(text)

enter image description here

Currently, you can also remove the :contains

text = [i.text.strip() for i in soup.select('p:has(strong), p:has(strong) + p + ul li')]
print(text)

The + is a css adjacent sibling combinator. Read more here. Quote:

Adjacent sibling combinator

The + combinator selects adjacent siblings. This means that the second element directly follows the first, and both share the same parent.

Syntax: A + B

Example: h2 + p will match all <p> elements that directly follow an <h2>.

Upvotes: 3

Maaz
Maaz

Reputation: 2445

First, the page is rendered dynamically so you have to use selenium to get the page content correctly.

Second, you can find the p tag where the text Here’s what’s new is present and finally get the next ul tag.

Here is the code:

from bs4 import BeautifulSoup as soup
from selenium import webdriver

url = "https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS"

driver = webdriver.Firefox()

page = driver.get(url)

html = soup(driver.page_source, 'html.parser')

for p in html.find_all('p'):
    if p.text and "Here’s what’s new" in p.text:
        ul = p.find_next_sibling('ul')
        for li in ul.find_all('li'):
            print(li.text)

OUTPUT:

Read Now: In the coming weeks, you will be able to read items that you own with a single click from the ‘Before You Go’ dialog.

Performance improvements, bug fixes, and other general enhancements.

Upvotes: 0

Related Questions