rahlf23
rahlf23

Reputation: 9019

Python - Extracting text from webpage PDF

So I have come across a few posts that deal with converting PDF's to HTML or converting them to text, however they all deal with doing so from a file saved to the computer. Is there a way to extract the text from a webpage PDF without downloading the PDF file itself (as I will be doing so for a large number of files by iterating through a list of URL's)?

I am also curious which is the best library to achieve this with. pdfkit, pdf2txt, pdfminer, etc.?

Here is an example website with the format I will be dealing with: http://www.arkansasrazorbacks.com/wp-content/uploads/2017/02/Miami-Ohio-Game-2.pdf

Upvotes: 6

Views: 13570

Answers (3)

Ankesh
Ankesh

Reputation: 21

just a minor update to above answer

import PyPDF2
import requests
import io


url = 'http://www.arkansasrazorbacks.com/wp-content/uploads/2017/02/Miami-Ohio-Game-2.pdf'

response = requests.get(url)
f = io.BytesIO(response.content)
reader = PyPDF2.PdfReader(f)
pages = reader.pages
# get all pages data
text = "".join([page.extract_text() for page in pages])

Upvotes: 1

Andriy125
Andriy125

Reputation: 21

Updated the code for the PyPDF2 library

import io
import requests
import PyPDF2

url = 'http://www.arkansasrazorbacks.com/wp-content/uploads/2017/02/Miami-Ohio-Game-2.pdf'

r = requests.get(url)
f = io.BytesIO(r.content)

reader = PyPDF2.PdfReader(f)
contents = reader.pages[2].extract_text().split('\n')

Upvotes: 1

Dror Av.
Dror Av.

Reputation: 1214

You can download the file as a byte stream with requests wrapping it with io.BytesIO(), just so:

import io

import requests
from pyPdf import PdfFileReader

url = 'http://www.arkansasrazorbacks.com/wp-content/uploads/2017/02/Miami-Ohio-Game-2.pdf'

r = requests.get(url)
f = io.BytesIO(r.content)

reader = PdfFileReader(f)
contents = reader.getPage(0).extractText().split('\n')

f is a file like object you can use just like you opened a PDF file. this way the file is only in the memory and never saved locally.

In order to get text from the PDF file you can use PyPdf.

Upvotes: 8

Related Questions