Reputation: 15
Hi everyone I need some help with my web-scraper as I want to download 100s of pdf files from https://jbiomedsci.biomedcentral.com/ as I'm trying to download as much biomedical pdfs as I can from the website. I have built the web-scraper using some answers from this website but I can't seem to get it to work properly.
My aim is to download the pdfs and store them in specific folder and I would grateful for any help with this.
url="https://jbiomedsci.biomedcentral.com/articles"
response = request.urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")
links = soup.find_all('a', href=re.compile(r'(.pdf)'))
url_list = []
for el in links:
if(el['href'].startswith('http')):
url_list.append(el['href'])
else:
url_list.append("https://jbiomedsci.biomedcentral.com" + el['href'])
print(url_list)
for url in url_list:
print(url)
pathname ="C:/Users/SciencePDF/"
fullfilename = os.path.join(pathname, url.replace("https://jbiomedsci.biomedcentral.com/articles",
""))
print(fullfilename)
request.urlretrieve(url, fullfilename)
Upvotes: 0
Views: 957
Reputation: 22440
I've modified your script to make it work. When you try the following script, it will create a folder within the same directory where the location of your script is and store the downloaded pdf files within the newly created folder.
import os
import requests
from bs4 import BeautifulSoup
base = 'https://jbiomedsci.biomedcentral.com{}'
url = 'https://jbiomedsci.biomedcentral.com/articles'
res = requests.get(url)
soup = BeautifulSoup(res.text,"html.parser")
foldername = url.split("/")[-1]
os.mkdir(foldername)
for pdf in soup.select("a[data-track-action='Download PDF']"):
filename = pdf['href'].split("/")[-1]
fdf_link = base.format(pdf['href']) + ".pdf"
with open(f"{foldername}/{filename}.pdf", 'wb') as f:
f.write(requests.get(fdf_link).content)
Upvotes: 1