Reputation: 5071
I need to find a way to download all PDF files to be found in a given url and I found a script that supposedly accomplishes this task (I have not tested it):
import urllib.parse
import urllib2
import os
import sys
from bs4 import BeautifulSoup
from urllib3 import request
url = "https://...."
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0"}
i = 0
request = urlib2.request(url, None, headers)
html = urllib2.urlopen(request)
soup = BeuatifulSoup(html.read())
for tag in soup.findAll("a" , href = True)
tag["href"] = urlparse.urljoin(url, tag["href"])
if os.path.splitext(os.path.basename(tag["href"]))[1] == ".pdf"
current = urllib2.urlopen(tag["href"])
print("\n[*] Downloading: %s" %(os.path.basename(tag["href"])))
f = open(download_path + "\\" + os.path.basename(tag["href"], "wb"))
f.write(current.read())
f.close()
i += 1
print("\n[*] Downloaded %d files" %(i + 1))
raw_input("[+] Press any key to exit ... ")
The problem is that I have Python 3.3 installed and this script does not run with Python 3.3. E.g. urllib2 is not available for Python 3.3.
How can I amend the script to be compatible with Python 3.3?
Upvotes: 0
Views: 3357
Reputation: 3460
For Python 3, you should use import urllib.request
instead of urllib2
. It's important to first evaluate the html source code of the url you're trying to parse. For example, some might have the og_url property while others may not. Depending on this, the way to extract the pdf links could be different.
There's a quick solution along with a detailed explanation on downloading pdfs here:
Upvotes: -1
Reputation: 611
Why not a single line of bash : wget -r -l1 -A.pdf http://www.example.com/page-with-pdfs.htm
Upvotes: 1
Reputation: 71
As somebody pointed out, a shell script may be a much better way to accomplish your goals.
However, if you are set on using python to do this, you could keep your python 3.3 environment intact, and install what is called a "virtual environment". Inside the virtual environment you can have whatever Python version and libraries you want, and it will not interfere with your current Python installation.
There is a good tutorial here for getting started with a virtual environment.
Upvotes: 0
Reputation: 184995
Why not as a 3 lines shell script requiring just one perl module ?
mech-dump --links http://domain.tld/path |
grep -i '\.pdf$' |
xargs wget -n1
Package libwww-mechanize-perl
for debian and derivatives
Upvotes: 1