tas
tas

Reputation: 51

How to web scraping <p> tags inside <div> tags that has class/id from HTML using Python

I want to extract the data such as

Release date: June 16, 2016 Vulnerability identifier: APSB16-23 Priority: 3 CVE number: CVE-2016-4126

from https://helpx.adobe.com/security/products/air/apsb16-23.ug.html

The code:

import requests
from bs4 import BeautifulSoup as bs
from pprint import pprint
    
r = requests.get('https://helpx.adobe.com/cy_en/security/products/air/apsb16-31.html')
soup = bs(r.content, 'html.parser')
pprint([i.text for i in soup.select('div > .text >  p' , limit = 4 )] )

The output:

['Release date:\xa0September 13, 2016',
 'Vulnerability identifier: APSB16-31',
 'Priority: 3',
 'CVE number:\xa0CVE-2016-6936']

The problem is there is /xa0. How should I remove it? and if there is any others efficient code than this? and I also wanted to convert it into CSV file. Thank you.

Upvotes: 0

Views: 100

Answers (2)

baduker
baduker

Reputation: 20022

\xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.

Try this:

import requests
from bs4 import BeautifulSoup as bs
from pprint import pprint

r = requests.get('https://helpx.adobe.com/cy_en/security/products/air/apsb16-31.html')
soup = bs(r.content, 'html.parser')
pprint([i.text.replace(u'\xa0', u' ') for i in soup.select('div > .text >  p', limit=4)])

Output:

['Release date: September 13, 2016',
 'Vulnerability identifier: APSB16-31',
 'Priority: 3',
 'CVE number: CVE-2016-6936']

EDIT: To drop the result to a .csv file use pandas.

Here's how:

import pandas as pd
import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://helpx.adobe.com/cy_en/security/products/air/apsb16-31.html')
soup = bs(r.content, 'html.parser')
release = [
    i.getText().replace(u'\xa0', u' ').split(": ") for i
    in soup.select('div > .text >  p', limit=4)
]
pd.DataFrame(release).set_index(0).T.to_csv("release_data.csv", index=False)

Output:

enter image description here

Upvotes: 2

Karan Mittal
Karan Mittal

Reputation: 181

I just used your code and added a for loop over the extracted HTML tags. It seems that while using list comprehension the unicode converter is not present. Its just an assumption though.

as for the script I just improvised yours.

import requests
from bs4 import BeautifulSoup
from pprint import pprint

url = "https://helpx.adobe.com/cy_en/security/products/air/apsb16-31.html"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
data = [i for i in soup.select('div > .text >  p', limit=4)]

for i in data:
    print(i.text)
    print("-"*20)

this will give you desired output. see the link of the image as it won't show the image here itself. enter image description here

Upvotes: 0

Related Questions