Reputation: 23
I want to extract all the pdf links which takes us to the page directly from where we can download all the pdfs . I want to store these pdfs in a data frame
url = "https://www.volvogroup.com/en/news-and-media/press-releases.html"
source = requests.get(url)
soup = BeautifulSoup(source.text , "html.parser")
news_check = soup.find_all("a" , class_ = "articlelist__contentDownloadItem")
for i in news_check :
print(i)
break
data = set()
for i in soup.find_all('a'):
for j in i.find_all('href'):
pdf_link = "https://www.volvogroup.com" + j.get('.pdf')
data.add(j)
print(pdf_link)
Upvotes: 0
Views: 324
Reputation: 1560
You can try below code to get pdf link:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
url = "https://www.volvogroup.com/en/news-and-media/press-releases.html"
source = requests.get(url)
soup = bs(source.text , "html.parser")
news_check = soup.find_all("a" , class_ = "articlelist__contentDownloadItem")
data = set()
for i in news_check:
pdf_link ="https://www.volvogroup.com" + i['href']
data.add(pdf_link)
# for j in i.find_all('href'):
# pdf_link = + j.get('.pdf')
# data.add(j)
# print(pdf_link)
df = pd.DataFrame(data)
print(df)
Output :
0 https://www.volvogroup.com/content/dam/volvo-g...
1 https://www.volvogroup.com/content/dam/volvo-g...
2 https://www.volvogroup.com/content/dam/volvo-g...
3 https://www.volvogroup.com/content/dam/volvo-g...
4 https://www.volvogroup.com/content/dam/volvo-g...
5 https://www.volvogroup.com/content/dam/volvo-g...
6 https://www.volvogroup.com/content/dam/volvo-g...
7 https://www.volvogroup.com/content/dam/volvo-g...
8 https://www.volvogroup.com/content/dam/volvo-g...
9 https://www.volvogroup.com/content/dam/volvo-g...
10 https://www.volvogroup.com/content/dam/volvo-g...
11 https://www.volvogroup.com/content/dam/volvo-g...
12 https://www.volvogroup.com/content/dam/volvo-g...
Upvotes: 1