Rahul Pipalia
Rahul Pipalia

Reputation: 71

Reading pdf files line by line using python

I used the following code to read the pdf file, but it does not read it. What could possibly be the reason?

from PyPDF2 import PdfFileReader

reader = PdfFileReader("example.pdf")
contents = reader.pages[0].extractText().split("\n")
print(contents)

The output is [u''] instead of reading the content.

Upvotes: 6

Views: 20205

Answers (7)

Piyush Rumao
Piyush Rumao

Reputation: 383

import re
from PyPDF2 import PdfFileReader

reader = PdfFileReader("example.pdf")

for page in reader.pages:
    text = page.extractText()
    text_lower = text.lower()
    for line in text_lower:
        if re.search("abc", line):
            print(line)

I use it to iterate page by page of pdf and search for key terms in it and process further.

Upvotes: 5

Martin Thoma
Martin Thoma

Reputation: 136187

The issue was one of two things: (1) The text was not on page one - hence a user error. (2) PyPDF2 failed to extract the text - hence a bug in PyPDF2.

Sadly, the second one still happens for some PDFs.

Upvotes: 0

thrinadhn
thrinadhn

Reputation: 2503

def getTextPDF(pdfFileName,password=''):
    import PyPDF2
    from PyPDF2 import PdfFileReader, PdfFileWriter
    from nltk import sent_tokenize
    """ Extract Text from pdf  """
    pdf_file=open(pdfFileName,'rb')
    read_pdf=PyPDF2.PdfFileReader(pdf_file)
    if password !='':
        read_pdf.decrypt(password)
    text=[]
    for i in range(0,read_pdf.getNumPages()):
        text.append(read_pdf.getPage(i).extractText())
    text = '\n'.join (text).replace("\n",'')
    text = sent_tokenize(text)
    return text

Upvotes: 0

Anush
Anush

Reputation: 89

To Read the files from Multiple Folders in a directory, below code can be used- This Example is for reading pdf files:

import os
from tika import parser

path = "/usr/local/" # path directory
directory=os.path.join(path)
for r,d,f in os.walk(directory): #going through subdirectories
    for file in f:
        if ".pdf" in file:  # reading only PDF files
            file_join = os.path.join(r, file)   #getting full path 
            file_data = parser.from_file(file_join)     # parsing the PDF file 
            text = file_data['content']               # read the content 
            print(text)                  #print the content

Upvotes: 0

Ahaha
Ahaha

Reputation: 426

I think you need to specify the disc name, it's missing in your directory. For example "D:/Users/Rahul/Desktop/Dfiles/106_2015_34-76357.pdf". I tried and I can read without any problem.

Or if you want to find the file path using the os module which you didn't really associate with your directory, you can try the following:

from PyPDF2 import PdfFileReader
import os

def find(name, path):
    for root, dirs, files in os.walk(path):
        if name in files:
            return os.path.join(root, name)

directory = find('106_2015_34-76357.pdf', 'D:/Users/Rahul/Desktop/Dfiles/')

f = open(directory, 'rb')

reader = PdfFileReader(f)

contents = reader.getPage(0).extractText().split('\n')

f.close()

print(contents)

The find function can be found in Nadia Alramli's answer here Find a file in python

Upvotes: 0

Mayur Vora
Mayur Vora

Reputation: 942

Hello Rahul Pipalia,

If not install PyPDF2 in your python so first install PyPDF2 after use this module.

Installation Steps for Ubuntu (Install python-pypdf)

  1. First, open terminal
  2. After type sudo apt-get install python-pypdf

Your Probelm Solution

Try this below code,

# Import Library
import PyPDF2

# Which you want to read file so give file name with ".pdf" extension
pdf_file = open('Your_Pdf_File_Name.pdf')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()

#Give page number of the pdf file (How many page in pdf file).
# @param Page_Nuber_of_the_PDF_file: Give page number here i.e 1
page = read_pdf.getPage(Page_Nuber_of_the_PDF_file)

page_content = page.extractText()

# Display content of the pdf
print page_content

Download the PDF from below link and try this code, https://www.dropbox.com/s/4qad66r2361hvmu/sample.pdf?dl=1

I hope my answer is helpful.
If any query so comments, please.

Upvotes: -2

Tejas Thakar
Tejas Thakar

Reputation: 583

May be this can help you to read PDF.

import pyPdf
def getPDFContent(path):
    content = ""
    pages = 10
    p = file(path, "rb")
    pdf_content = pyPdf.PdfFileReader(p)
    for i in range(0, pages):
        content += pdf_content.getPage(i).extractText() + "\n"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

Upvotes: 0

Related Questions