Reputation: 205

Web Scraping FileNotFoundError When Saving to PDF

I copy some Python code in order to download data from a website. Here is my specific website: https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017-1

Here is the code which I copied:

import requests
from bs4 import BeautifulSoup

def _getUrls_(res):
    hrefs = []
    soup = BeautifulSoup(res.text, 'lxml')
    main_content = soup.find('div',{'id' : 'content-core'})
    table = main_content.find("table")
    for a in table.findAll('a', href=True):
        hrefs.append(a['href'])
    return(hrefs)

bidurl = 'https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017-1'
r = requests.get(bidurl)
hrefs = _getUrls_(r)

def _getPdfs_(hrefs, basedir):
    for i in range(len(hrefs)):
        print(hrefs[i])
        respdf = requests.get(hrefs[i])
        pdffile = basedir + "/pdf_dot/" + hrefs[i].split("/")[-1] + ".pdf"
        try:
            with open(pdffile, 'wb') as p:
                p.write(respdf.content)
                p.close()
        except FileNotFoundError:
            print("No PDF produced")

basedir= "/Users/ABC/Desktop"
_getPdfs_(hrefs, basedir)

The code runs successfully, but it did not download anything at all, even though there is no Filenotfounderror obviously.

I tried the following two URLs:

https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017/aqc-088a-035-20360
https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017/aqc-r100-258-21125

However both of these URLs return >>> No PDF produced.

The thing is that the code worked and downloaded successfully for other people, but not me.

Upvotes: 1

Answers (4)

SIM

Reputation: 22440

You don't need to specify the directory or create any folder manually. All you need do is run the following script. When the execution is done, you should get a folder named pdf_dot in your desktop containing the pdf files you wish to grab.

import requests
from bs4 import BeautifulSoup
import os

URL = 'https://www.codot.gov/business/bidding/bid-tab-archives/bid-tabs-2017-1'

dirf = os.environ['USERPROFILE'] + '\Desktop\pdf_dot'
if not os.path.exists(dirf):os.makedirs(dirf)
os.chdir(dirf)

res = requests.get(URL)
soup = BeautifulSoup(res.text, 'lxml')
pdflinks = [itemlink['href'] for itemlink in soup.find_all("a",{"data-linktype":"internal"}) if "reject" not in itemlink['href']]
for pdflink in pdflinks:
    filename = f'{pdflink.split("/")[-1]}{".pdf"}'
    with open(filename, 'wb') as f:
        f.write(requests.get(pdflink).content)

Upvotes: 0

RoadRunner

Reputation: 26315

As others have pointed out, you need to create basedir beforehand. The user running the script may not have the directory created. Make sure you insert this code at the beginning of the script, before the main logic.

Additionally, hardcoding the base directory might not be a good idea when transferring the script to different systems. It would be preferable to use the users %USERPROFILE% enviorment variable:

from os import envioron
basedir= join(environ["USERPROFILE"], "Desktop", "pdf_dot")

Which would be the same as C:\Users\blah\Desktop\pdf_dot.

However, the above enivorment variable only works for Windows. If you want it to work Linux, you will have to use os.environ["HOME"] instead.

If you need to transfer between both systems, then you can use os.name:

from os import name
from os import environ

# Windows
if name == 'nt':
    basedir= join(environ["USERPROFILE"], "Desktop", "pdf_dot")

# Linux
elif name == 'posix':
    basedir = join(environ["HOME"], "Desktop", "pdf_dot")

Upvotes: 1

Nikolaos Paschos

Reputation: 426

I used this exact (indented) code but replaced the basedir with my own dir and it worked only after I made sure that the path actually exists. This code does not create the folder in case it does not exist.

Upvotes: 4

andrea-f

Reputation: 1055

Your code works I just tested. You need to make sure the basedir exists, you want to add this to your code:

if not os.path.exists(basedir):
    os.makedirs(basedir)

Upvotes: 5

Web Scraping FileNotFoundError When Saving to PDF

Answers (4)

Related Questions