Luigi
Luigi

Reputation: 4129

Working with a pdf from the web directly in Python?

I'm trying to use Python to read .pdf files from the web directly rather than save them all to my computer. All I need is the text from the .pdf and I'm going to be reading a lot (~60k) of them, so I'd prefer to not actually have to save them all.

I know how to save a .pdf from the internet using urllib and open it with PyPDF2. (example)

I want to skip the saving-to-file step.

import urllib, PyPDF2
urllib.urlopen('https://bitcoin.org/bitcoin.pdf')
wFile = urllib.urlopen('https://bitcoin.org/bitcoin.pdf')
lFile = PyPDF2.pdf.PdfFileReader(wFile.read())

I get an error that is fairly easy to understand:

Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    fil = PyPDF2.pdf.PdfFileReader(wFile.read())
  File "C:\Python27\lib\PyPDF2\pdf.py", line 797, in __init__
    self.read(stream)
  File "C:\Python27\lib\PyPDF2\pdf.py", line 1245, in read
    stream.seek(-1, 2)
AttributeError: 'str' object has no attribute 'seek'

Obviously PyPDF2 doesn't like that I'm giving it the urllib.urlopen().read() (which appears to return a string). I know that this string is not the "text" of the .pdf but a string representation of the file. How can I resolve this?

EDIT: NorthCat's solution resolved my error, but when I try to actually extract the text, I get this:

>>> print lFile.getPage(0).extractText()
ˇˆ˘˘˙˘˘˝˘˛˘ˇ˘ˇ˚ˇˇˇ˘ˆ˘˘˘˚ˇˆ˘ˆ˘ˇ˜ˇ˝˚˘˛˘ˇ ˘˘˘ˇ˛˘˚˚ˆˇˇ!
˝˘˚ˇ˘˘˚"˘˘ˇ˘˚ˇ˘˘˚ˇ˘˘˘˙˘˘˘#˘˘˘ˆ˘˛˘˚˛˙ ˘˘˚˚˘˛˙#˘ˇ˘ˇˆ˘˘˛˛˘˘!˘˘˛˘˝˘˘˘˚ ˛˘˘ˇ˘ˇ˛$%&˘ˇ'ˆ˛
$%&˘ˇˇ˘˚ˆ˚˘˘˘˘ ˘ˆ(ˇˇ˘˘˘˘ˇ˘˚˘˘#˘˘˘ˇ˛!ˇ)˘˘˚˘˘˛ ˚˚˘ˇ˘˝˘˚'˘˘ˇˇ ˘˘ˇ˘˛˙˛˛˘˘˚ˇ˘˘ˆ˘˘ˆ˙
$˘˘˘*˘˘˘ˇˆ˘˘ˇˆ˛ˇ˘˝˚˚˘˘ˇ˘ˆ˘"˘ˆ˘ˇˇ˘˛ ˛˛˘˛˘˘˘˘˘˘˛˘˘˚˚˘$ˇ˘ˇˆ˙˘˝˘ˇ˘˘˘ˇˇˆˇ˘ ˘˛ˇ˝˘˚˚#˘˛˘˚˘˘ 
˘ˇ˘˚˛˛˘ˆ˛ˇˇˇ ˚˘˘˚˘˘ˇ˛˘˙˘˝˘ˇ˘ˆ˘˛˙˘˝˘ˇ˘˘˝˘"˘˛˘˝˘ˇ ˘˘˘˚˛˘˚)˘˘ˆ˛˘˘ 
˘˛˘˛˘ˆˇ˚˘˘˘˘˚˘˘˘˘˛˛˚˘˚˝˚ˇ˘#˘˘˚ˆ˘˘˘˝˘˚˘ˆˆˇ˘ˆ 
˘˘˘ˆ˘˝˘˘˚"˘˘˚˘˚˘ˇ˘ˆ˘ˆ˘˚ˆ˛˚˛ˆ˚˘˘˘˘˘˘˚˛˚˚ˆ#˘ˇˇˆˇ˘˝˘˘ˇ˚˘ˇˇ˘˛˛˚ ˚˘˘˘ˇ˚˘˘ˇ˘˘˚ˆ˘*˘ 
˘˘ˇ˘˚ˇ˘˙˘˚ˇ˘˘˘˙˙˘˘˚˚˘˘˝˘˘˘˛˛˘ˇˇ˚˘˛#˘ˆ˘˘ˇ˘˚˘ˇˇ˘˘ˇˆˇ˘$%&˘ˆ˘˛˘˚˘,

Upvotes: 3

Views: 7372

Answers (2)

pyjavo
pyjavo

Reputation: 1613

I know this question is old, but I had the same issue and here is how I solved it. In the newer docs of Py2PDF there is a section about streaming data

The example there looks like this:

from io import BytesIO

# Prepare example
with open("example.pdf", "rb") as fh:
    bytes_stream = BytesIO(fh.read())

# Read from bytes_stream
reader = PdfReader(bytes_stream)

Therefore, what I did instead was this:

import urllib
from io import BytesIO
from PyPDF2 import PdfReader

NEW_PATH = 'https://example.com/path/to/pdf/online?id=123456789&date=2022060'

wFile = urllib.request.urlopen(NEW_PATH)
bytes_stream = BytesIO(wFile.read())

reader = PdfReader(bytes_stream)

Upvotes: 2

NorthCat
NorthCat

Reputation: 9937

Try this:

import urllib, PyPDF2
import cStringIO

wFile = urllib.urlopen('https://bitcoin.org/bitcoin.pdf')
lFile = PyPDF2.pdf.PdfFileReader( cStringIO.StringIO(wFile.read()) )

Because PyPDF2 does not work, there are a couple of solutions, however, require saving the file to disk.

Solution 1 You can use ps2ascii (if you are using linux or mac ) or xpdf (Windows). Example of using xpdf:

import os
os.system('C:\\xpdfbin-win-3.03\\bin32\\pdftotext.exe C:\\xpdfbin-win-3.03\\bin32\\bitcoin.pdf bitcoin1.txt')

or

import subprocess
subprocess.call(['C:\\xpdfbin-win-3.03\\bin32\\pdftotext.exe',  'C:\\xpdfbin-win-3.03\\bin32\\bitcoin.pdf', 'bitcoin2.txt'])

Solution 2 You can use one of online pdf to txt converter. Example of using pdf.my-addr.com

import MultipartPostHandler
import urllib2


def pdf2text( absolute_path ):
    url = 'http://pdf.my-addr.com/pdf-to-text-converter-tool.php'

    params = {  'file' : open( absolute_path, 'rb' ),
                'encoding': 'UTF-8',
    }
    opener = urllib2.build_opener( MultipartPostHandler.MultipartPostHandler )
    return opener.open( url, params ).read()

print pdf2text('bitcoin.pdf')

Code of MultipartPostHandler you can find here. I tried to use the cStringIO instead open(), but it did not work. Maybe it will be helpful for you.

Upvotes: 1

Related Questions