tas
tas

Reputation: 51

How to web scraping specific tags <p> in <div> using Python from HTML

The data that I want to extract is from this website https://www.adobe.com/support/security/advisories/apsa11-04.html. I just want to extract

Release date: December 6, 2011 Last updated: January 10, 2012 Vulnerability identifier: APSA11-04 CVE number: CVE-2011-2462

the code:

from bs4 import BeautifulSoup
div = soup.find("div", attrs={"id": "L0C1-body"})
for p in div.findAll("p"):
    if p.find('strong'):
        print(p.text)

the output:

Release date: December 6, 2011
Last updated: January  10, 2012
Vulnerability identifier: APSA11-04
CVE number: CVE-2011-2462
Platform: All
*Note: Adobe Reader for Android and Adobe Flash Player are not affected by this issue.

I do not want this information. How should I filter it?

Platform: All *Note: Adobe Reader for Android and Adobe Flash Player are not affected by this issue.

Upvotes: 2

Views: 348

Answers (2)

QHarr
QHarr

Reputation: 84455

Rather than retrieve an entire collection I would go with a more efficient filtering to the first 4 sibling p tags within the selector itself with :nth-of-type:

import requests
from bs4 import BeautifulSoup as bs
from pprint import pprint
    
r = requests.get('https://www.adobe.com/support/security/advisories/apsa11-04.html')
soup = bs(r.content, 'html.parser')
pprint([i.text for i in soup.select('h2 ~ p:nth-of-type(-n+4)')])

You could also use limit argument:

pprint([i.text for i in soup.select('h2 ~ p', limit = 4)])

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195408

If you know you want always first 4 <p> tags after <h2> tag, you can use this example:

import requests
from bs4 import BeautifulSoup


url = "https://www.adobe.com/support/security/advisories/apsa11-04.html"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

txt = "\n".join(
    map(lambda x: x.get_text(strip=True, separator=" "), soup.select("h2 ~ p")[:4])
)
print(txt)

Prints:

Release date: December 6, 2011
Last updated: January  10, 2012
Vulnerability identifier: APSA11-04
CVE number: CVE-2011-2462

Upvotes: 1

Related Questions