Reputation: 71
I used the following code to read the pdf file, but it does not read it. What could possibly be the reason?
from PyPDF2 import PdfFileReader
reader = PdfFileReader("example.pdf")
contents = reader.pages[0].extractText().split("\n")
print(contents)
The output is [u''] instead of reading the content.
Upvotes: 6
Views: 20205
Reputation: 383
import re
from PyPDF2 import PdfFileReader
reader = PdfFileReader("example.pdf")
for page in reader.pages:
text = page.extractText()
text_lower = text.lower()
for line in text_lower:
if re.search("abc", line):
print(line)
I use it to iterate page by page of pdf and search for key terms in it and process further.
Upvotes: 5
Reputation: 136187
The issue was one of two things: (1) The text was not on page one - hence a user error. (2) PyPDF2 failed to extract the text - hence a bug in PyPDF2.
Sadly, the second one still happens for some PDFs.
Upvotes: 0
Reputation: 2503
def getTextPDF(pdfFileName,password=''):
import PyPDF2
from PyPDF2 import PdfFileReader, PdfFileWriter
from nltk import sent_tokenize
""" Extract Text from pdf """
pdf_file=open(pdfFileName,'rb')
read_pdf=PyPDF2.PdfFileReader(pdf_file)
if password !='':
read_pdf.decrypt(password)
text=[]
for i in range(0,read_pdf.getNumPages()):
text.append(read_pdf.getPage(i).extractText())
text = '\n'.join (text).replace("\n",'')
text = sent_tokenize(text)
return text
Upvotes: 0
Reputation: 89
To Read the files from Multiple Folders in a directory, below code can be used- This Example is for reading pdf files:
import os
from tika import parser
path = "/usr/local/" # path directory
directory=os.path.join(path)
for r,d,f in os.walk(directory): #going through subdirectories
for file in f:
if ".pdf" in file: # reading only PDF files
file_join = os.path.join(r, file) #getting full path
file_data = parser.from_file(file_join) # parsing the PDF file
text = file_data['content'] # read the content
print(text) #print the content
Upvotes: 0
Reputation: 426
I think you need to specify the disc name, it's missing in your directory. For example "D:/Users/Rahul/Desktop/Dfiles/106_2015_34-76357.pdf". I tried and I can read without any problem.
Or if you want to find the file path using the os module which you didn't really associate with your directory, you can try the following:
from PyPDF2 import PdfFileReader
import os
def find(name, path):
for root, dirs, files in os.walk(path):
if name in files:
return os.path.join(root, name)
directory = find('106_2015_34-76357.pdf', 'D:/Users/Rahul/Desktop/Dfiles/')
f = open(directory, 'rb')
reader = PdfFileReader(f)
contents = reader.getPage(0).extractText().split('\n')
f.close()
print(contents)
The find function can be found in Nadia Alramli's answer here Find a file in python
Upvotes: 0
Reputation: 942
Hello Rahul Pipalia,
If not install PyPDF2
in your python so first install PyPDF2
after use this module.
terminal
sudo apt-get install python-pypdf
Try this below code,
# Import Library
import PyPDF2
# Which you want to read file so give file name with ".pdf" extension
pdf_file = open('Your_Pdf_File_Name.pdf')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
#Give page number of the pdf file (How many page in pdf file).
# @param Page_Nuber_of_the_PDF_file: Give page number here i.e 1
page = read_pdf.getPage(Page_Nuber_of_the_PDF_file)
page_content = page.extractText()
# Display content of the pdf
print page_content
Download the PDF from below link and try this code, https://www.dropbox.com/s/4qad66r2361hvmu/sample.pdf?dl=1
I hope my answer is helpful.
If any query so comments, please.
Upvotes: -2
Reputation: 583
May be this can help you to read PDF.
import pyPdf
def getPDFContent(path):
content = ""
pages = 10
p = file(path, "rb")
pdf_content = pyPdf.PdfFileReader(p)
for i in range(0, pages):
content += pdf_content.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
Upvotes: 0