Reputation: 51
I want to extract the data such as
Release date: June 16, 2016 Vulnerability identifier: APSB16-23 Priority: 3 CVE number: CVE-2016-4126
from https://helpx.adobe.com/security/products/air/apsb16-23.ug.html
The code:
import requests
from bs4 import BeautifulSoup as bs
from pprint import pprint
r = requests.get('https://helpx.adobe.com/cy_en/security/products/air/apsb16-31.html')
soup = bs(r.content, 'html.parser')
pprint([i.text for i in soup.select('div > .text > p' , limit = 4 )] )
The output:
['Release date:\xa0September 13, 2016',
'Vulnerability identifier: APSB16-31',
'Priority: 3',
'CVE number:\xa0CVE-2016-6936']
The problem is there is /xa0. How should I remove it? and if there is any others efficient code than this? and I also wanted to convert it into CSV file. Thank you.
Upvotes: 0
Views: 100
Reputation: 20022
\xa0
is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.
Try this:
import requests
from bs4 import BeautifulSoup as bs
from pprint import pprint
r = requests.get('https://helpx.adobe.com/cy_en/security/products/air/apsb16-31.html')
soup = bs(r.content, 'html.parser')
pprint([i.text.replace(u'\xa0', u' ') for i in soup.select('div > .text > p', limit=4)])
Output:
['Release date: September 13, 2016',
'Vulnerability identifier: APSB16-31',
'Priority: 3',
'CVE number: CVE-2016-6936']
EDIT: To drop the result to a .csv
file use pandas
.
Here's how:
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://helpx.adobe.com/cy_en/security/products/air/apsb16-31.html')
soup = bs(r.content, 'html.parser')
release = [
i.getText().replace(u'\xa0', u' ').split(": ") for i
in soup.select('div > .text > p', limit=4)
]
pd.DataFrame(release).set_index(0).T.to_csv("release_data.csv", index=False)
Output:
Upvotes: 2
Reputation: 181
I just used your code and added a for loop over the extracted HTML tags. It seems that while using list comprehension the unicode converter is not present. Its just an assumption though.
as for the script I just improvised yours.
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url = "https://helpx.adobe.com/cy_en/security/products/air/apsb16-31.html"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
data = [i for i in soup.select('div > .text > p', limit=4)]
for i in data:
print(i.text)
print("-"*20)
this will give you desired output. see the link of the image as it won't show the image here itself. enter image description here
Upvotes: 0