Reputation: 57
The texts in the pdf files are text formats, not scanned. PDFMiner does not support python3, is there any other solutions?
Upvotes: 2
Views: 3059
Reputation: 1115
tika
worked the best for me. It won't be wrong if I say it's better than PyPDF2
and pdfminer
This made it really easy to extract each line in the pdf into a list. You can install it by pip install tika
And, use the code below:
from tika import parser
rawText = parser.from_file(path_to_pdf)
rawList = rawText['content'].splitlines()
print(rawList)
Upvotes: 0
Reputation: 31
For python3, you can download pdfminer as:
python -m pip install pdfminer.six
Upvotes: 1
Reputation: 31
There is also the pdfminer2 fork, supported for python 3.4, which available through pip3. https://github.com/metachris/pdfminer
This thread helped me patch something together.
from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO, BytesIO
def readPDF(pdfFile):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(pdfFile, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
device.close()
textstr = retstr.getvalue()
retstr.close()
return textstr
if __name__ == "__main__":
#scrape = open("../warandpeace/chapter1.pdf", 'rb') # for local files
scrape = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf") # for external files
pdfFile = BytesIO(scrape.read())
outputString = readPDF(pdfFile)
print(outputString)
pdfFile.close()
Upvotes: 3