user8270077
user8270077

Reputation: 5071

Downloading all pdf files from a url address using Python

I need to find a way to download all PDF files to be found in a given url and I found a script that supposedly accomplishes this task (I have not tested it):

import urllib.parse
import urllib2
import os
import sys
from bs4 import BeautifulSoup

from urllib3 import request

url = "https://...."

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0"}

i = 0

request = urlib2.request(url, None, headers)

html = urllib2.urlopen(request)

soup = BeuatifulSoup(html.read())

for tag in soup.findAll("a" , href = True)

    tag["href"] = urlparse.urljoin(url, tag["href"])
    
    if os.path.splitext(os.path.basename(tag["href"]))[1] == ".pdf"
    
        current = urllib2.urlopen(tag["href"])
        
        print("\n[*] Downloading: %s" %(os.path.basename(tag["href"])))
        
        f = open(download_path + "\\" + os.path.basename(tag["href"], "wb"))
        
        f.write(current.read())
        
        f.close()
        
        i += 1
        
print("\n[*] Downloaded %d files" %(i + 1))

raw_input("[+] Press any key to exit ... ")

The problem is that I have Python 3.3 installed and this script does not run with Python 3.3. E.g. urllib2 is not available for Python 3.3.

How can I amend the script to be compatible with Python 3.3?

Upvotes: 0

Views: 3357

Answers (4)

x89
x89

Reputation: 3460

For Python 3, you should use import urllib.request instead of urllib2. It's important to first evaluate the html source code of the url you're trying to parse. For example, some might have the og_url property while others may not. Depending on this, the way to extract the pdf links could be different.

There's a quick solution along with a detailed explanation on downloading pdfs here:

https://medium.com/@dementorwriter/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48

Upvotes: -1

13013SwagR
13013SwagR

Reputation: 611

Why not a single line of bash : wget -r -l1 -A.pdf http://www.example.com/page-with-pdfs.htm

Upvotes: 1

John S
John S

Reputation: 71

As somebody pointed out, a shell script may be a much better way to accomplish your goals.

However, if you are set on using python to do this, you could keep your python 3.3 environment intact, and install what is called a "virtual environment". Inside the virtual environment you can have whatever Python version and libraries you want, and it will not interfere with your current Python installation.

There is a good tutorial here for getting started with a virtual environment.

Upvotes: 0

Gilles Quénot
Gilles Quénot

Reputation: 184995

Why not as a 3 lines script requiring just one module ?

mech-dump --links http://domain.tld/path |
grep -i '\.pdf$' |
xargs wget -n1

Package libwww-mechanize-perl for debian and derivatives

Upvotes: 1

Related Questions