Reputation: 43
I wanna extract Here’s what’s new section's items from this page, starting with In the coming weeks and ending with general enhancements.
Inspecting the code I see the <span
> is nested under<li>
which is then nested under <ul id="GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B">
. I tried to extract it with Python 3 and BeautifulSoup
for the last few days but to no avail. I'm pasting code I tried below.
Would somebody be so kind to guide me in the right direction?
1#
from urllib.request import urlopen # open URLs
from bs4 import BeautifulSoup # BS
import sys # sys.exit()
page_url = 'https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS'
try:
page = urlopen(page_url)
except:
sys.exit("No internet connection. Program exiting...")
soup = BeautifulSoup(page, 'html.parser')
try:
for ultag in soup.find_all('ul', {'id': 'GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B'}):
print(ultag.text)
for spantag in ultag.find_all('span'):
print(spantag)
except:
print("Couldn't get What's new :(")
2#
from urllib.request import urlopen # open URLs
from bs4 import BeautifulSoup # BS
import sys # sys.exit()
page_url = 'https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS'
try:
page = urlopen(page_url)
except:
sys.exit("No internet connection. Program exiting...")
soup = BeautifulSoup(page, 'html.parser')
uls = []
for ul in uls:
for ul in soup.findAll('ul', {'id': 'GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B'}):
if soup.find('ul'):
break
uls.append(ul)
print(uls)
for li in uls:
print(li.text)
Ideally code should return:
In the coming weeks, you will be able to read items that you own with a single click from the ‘Before You Go’ dialog.
Performance improvements, bug fixes, and other general enhancements.
But both give me nothing. It looks like it can't find ul
with that ID but if you print(soup)
everything looks good:
<ul id="GUID-8B03C49D-3A98-45F1-9128-392E55823F61__UL_E0490B159DE04E22AD519CE2E7D7A35B">
<li>
<span class="a-list-item"><span><strong>Read Now</strong></span>: In the coming weeks, you will be able to read items that you own with a single click from the �Before You Go� dialog.</span></li>
<li>
<span class="a-list-item">Performance improvements, bug fixes, and other general enhancements.<br></li>
</ul>
Upvotes: 2
Views: 691
Reputation: 84465
With bs4 4.7.1+ you can use :contains and :has to isolate
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS')
soup = bs(r.content, 'lxml')
text = [i.text.strip() for i in soup.select('p:has(strong:contains("Here’s what’s new:")), p:has(strong:contains("Here’s what’s new:")) + p + ul li')]
print(text)
Currently, you can also remove the :contains
text = [i.text.strip() for i in soup.select('p:has(strong), p:has(strong) + p + ul li')]
print(text)
The + is a css adjacent sibling combinator. Read more here. Quote:
Adjacent sibling combinator
The + combinator selects adjacent siblings. This means that the second element directly follows the first, and both share the same parent.
Syntax: A + B
Example:
h2 + p
will match all<p> elements that directly follow an <h2>
.
Upvotes: 3
Reputation: 2445
First, the page is rendered dynamically so you have to use selenium
to get the page content correctly.
Second, you can find the p
tag where the text Here’s what’s new is present and finally get the next ul
tag.
Here is the code:
from bs4 import BeautifulSoup as soup
from selenium import webdriver
url = "https://www.amazon.com/gp/help/customer/display.html/ref=hp_left_v4_sib?ie=UTF8&nodeId=G54HPVAW86CHYHKS"
driver = webdriver.Firefox()
page = driver.get(url)
html = soup(driver.page_source, 'html.parser')
for p in html.find_all('p'):
if p.text and "Here’s what’s new" in p.text:
ul = p.find_next_sibling('ul')
for li in ul.find_all('li'):
print(li.text)
OUTPUT:
Read Now: In the coming weeks, you will be able to read items that you own with a single click from the ‘Before You Go’ dialog.
Performance improvements, bug fixes, and other general enhancements.
Upvotes: 0