Reputation: 51
The data that I want to extract is from this website https://www.adobe.com/support/security/advisories/apsa11-04.html. I just want to extract
Release date: December 6, 2011 Last updated: January 10, 2012 Vulnerability identifier: APSA11-04 CVE number: CVE-2011-2462
the code:
from bs4 import BeautifulSoup
div = soup.find("div", attrs={"id": "L0C1-body"})
for p in div.findAll("p"):
if p.find('strong'):
print(p.text)
the output:
Release date: December 6, 2011
Last updated: January 10, 2012
Vulnerability identifier: APSA11-04
CVE number: CVE-2011-2462
Platform: All
*Note: Adobe Reader for Android and Adobe Flash Player are not affected by this issue.
I do not want this information. How should I filter it?
Platform: All *Note: Adobe Reader for Android and Adobe Flash Player are not affected by this issue.
Upvotes: 2
Views: 348
Reputation: 84455
Rather than retrieve an entire collection I would go with a more efficient filtering to the first 4 sibling p tags within the selector itself with :nth-of-type:
import requests
from bs4 import BeautifulSoup as bs
from pprint import pprint
r = requests.get('https://www.adobe.com/support/security/advisories/apsa11-04.html')
soup = bs(r.content, 'html.parser')
pprint([i.text for i in soup.select('h2 ~ p:nth-of-type(-n+4)')])
You could also use limit argument:
pprint([i.text for i in soup.select('h2 ~ p', limit = 4)])
Upvotes: 1
Reputation: 195408
If you know you want always first 4 <p>
tags after <h2>
tag, you can use this example:
import requests
from bs4 import BeautifulSoup
url = "https://www.adobe.com/support/security/advisories/apsa11-04.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
txt = "\n".join(
map(lambda x: x.get_text(strip=True, separator=" "), soup.select("h2 ~ p")[:4])
)
print(txt)
Prints:
Release date: December 6, 2011
Last updated: January 10, 2012
Vulnerability identifier: APSA11-04
CVE number: CVE-2011-2462
Upvotes: 1