Reputation: 2238
I have a set of links to pdf files:
https://www.duo.uio.no/bitstream/10852/9012/1/oppgave-2003-10-30.pdf
Some of them are restricted, meaning I won't be able to access the pdf file, while others will go directly to the pdf file itself, like the link above.
I'm currently using the requests package (python) to access the files, but there are far to many files for me to download, and I also don't want the files in pdf.
What I would like to do is go to each link, check if the link is a pdf file, download that file (if necessary), turn it into a txt file, and delete the original pdf file.
I have a shell script that is a very good pdf to txt converter, but is it possible to run a shell script from python?
Upvotes: 0
Views: 304
Reputation: 87074
Kieran Bristow has answered part of your question about how to run an external program from Python.
The other part of your question is about selectively downloading documents by checking whether the resource is a PDF document. Unless the remote server offers alternate representations of their documents (e.g. a text version), you will need to download the documents. To avoid downloading non-PDF documents you can send an initial HEAD
request and look at the reply headers to determine the content-type
like this:
import os.path
import requests
session = requests.session()
for url in [
'https://www.duo.uio.no/bitstream/10852/9012/1/oppgave-2003-10-30.pdf',
'https://www.duo.uio.no/bitstream/10852abcd/90121023/1234/oppgave-2003-10-30.pdf']:
try:
resp = session.head(url, allow_redirects=True)
resp.raise_for_status()
if resp.headers['content-type'] == 'application/pdf':
resp = session.get(url)
if resp.ok:
with open(os.path.basename(url), 'wb') as outfile:
outfile.write(resp.content)
print "Saved {} to file {}".format(url, os.path.basename(url))
else:
print 'GET request for URL {} failed with HTTP status "{} {}"'.format(url, resp.status_code, resp.reason)
except requests.HTTPError as exc:
print "HEAD failed for URL {} : {}".format(url, exc)
Upvotes: 2
Reputation: 338
Yes! It is entirely possible to run shell scripts from python. Take a look at the subprocess python module which allows you to create processes kind of how you would with a shell: https://docs.python.org/2/library/subprocess.html
For example:
import subprocess
process = subprocess.Popen(["echo", "message"], stdout=subprocess.PIPE)
print process.communicate()
There are many tutorials out there e.g: http://www.bogotobogo.com/python/python_subprocess_module.php
Upvotes: 2