Ghost123
Ghost123

Reputation: 15

Downloading multiple pdf's from website using web-scraping

Hi everyone I need some help with my web-scraper as I want to download 100s of pdf files from https://jbiomedsci.biomedcentral.com/ as I'm trying to download as much biomedical pdfs as I can from the website. I have built the web-scraper using some answers from this website but I can't seem to get it to work properly.

My aim is to download the pdfs and store them in specific folder and I would grateful for any help with this.

url="https://jbiomedsci.biomedcentral.com/articles"
response = request.urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")     
links = soup.find_all('a', href=re.compile(r'(.pdf)'))



url_list = []
  for el in links:
if(el['href'].startswith('http')):
url_list.append(el['href'])
   else:
    url_list.append("https://jbiomedsci.biomedcentral.com" + el['href'])

    print(url_list)



for url in url_list:
print(url)
pathname ="C:/Users/SciencePDF/"
fullfilename = os.path.join(pathname, url.replace("https://jbiomedsci.biomedcentral.com/articles", 
 ""))
print(fullfilename)
request.urlretrieve(url, fullfilename)

Upvotes: 0

Views: 957

Answers (1)

SIM
SIM

Reputation: 22440

I've modified your script to make it work. When you try the following script, it will create a folder within the same directory where the location of your script is and store the downloaded pdf files within the newly created folder.

import os
import requests
from bs4 import BeautifulSoup

base = 'https://jbiomedsci.biomedcentral.com{}'
url = 'https://jbiomedsci.biomedcentral.com/articles'

res = requests.get(url)
soup = BeautifulSoup(res.text,"html.parser")

foldername = url.split("/")[-1]
os.mkdir(foldername)

for pdf in soup.select("a[data-track-action='Download PDF']"):
    filename = pdf['href'].split("/")[-1]
    fdf_link = base.format(pdf['href']) + ".pdf"
    with open(f"{foldername}/{filename}.pdf", 'wb') as f:
        f.write(requests.get(fdf_link).content)

Upvotes: 1

Related Questions