Pdf to txt from http request

Question

I have a set of links to pdf files:

https://www.duo.uio.no/bitstream/10852/9012/1/oppgave-2003-10-30.pdf

Some of them are restricted, meaning I won't be able to access the pdf file, while others will go directly to the pdf file itself, like the link above.

I'm currently using the requests package (python) to access the files, but there are far to many files for me to download, and I also don't want the files in pdf.

What I would like to do is go to each link, check if the link is a pdf file, download that file (if necessary), turn it into a txt file, and delete the original pdf file.

I have a shell script that is a very good pdf to txt converter, but is it possible to run a shell script from python?

mhawke · Accepted Answer

Kieran Bristow has answered part of your question about how to run an external program from Python.

The other part of your question is about selectively downloading documents by checking whether the resource is a PDF document. Unless the remote server offers alternate representations of their documents (e.g. a text version), you will need to download the documents. To avoid downloading non-PDF documents you can send an initial HEAD request and look at the reply headers to determine the content-type like this:

import os.path
import requests

session = requests.session()

for url in [
    'https://www.duo.uio.no/bitstream/10852/9012/1/oppgave-2003-10-30.pdf',
    'https://www.duo.uio.no/bitstream/10852abcd/90121023/1234/oppgave-2003-10-30.pdf']:
    try:
        resp = session.head(url, allow_redirects=True)
        resp.raise_for_status()
        if resp.headers['content-type'] == 'application/pdf':
            resp = session.get(url)
            if resp.ok:
                with open(os.path.basename(url), 'wb') as outfile:
                    outfile.write(resp.content)
                    print "Saved {} to file {}".format(url, os.path.basename(url))
            else:
                print 'GET request for URL {} failed with HTTP status "{} {}"'.format(url, resp.status_code, resp.reason)
    except requests.HTTPError as exc:
        print "HEAD failed for URL {} : {}".format(url, exc)

Pdf to txt from http request

Answers (2)

Related Questions