Reputation: 1117
I'm opening a lot of PDF's and I want to delete the PDF's after they have been parsed, but the files remain open until the program is done running. How do I close the PDf's I open using PyPDF2?
Code:
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = PyPDF2.PdfFileReader(file(path, "rb"))
#Check for number of pages, prevents out of bounds errors
max = 0
if pdf.numPages > 3:
max = 3
else:
max = (pdf.numPages - 1)
# Iterate pages
for i in range(0, max):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
#pdf.close()
return content
Upvotes: 8
Views: 12532
Reputation: 5549
just open and close the file yourself
f = open(path, "rb")
pdf = PyPDF2.PdfFileReader(f)
f.close()
PyPDF2 .read()
s the stream that you pass in, right in the constructor. So after the initial object construction, you can just toss the file.
A context manager will work, too:
with open(path, "rb") as f:
pdf = PyPDF2.PdfFileReader(f)
do_other_stuff_with_pdf(pdf)
Upvotes: 11
Reputation: 3134
Yes, you are passing in the stream to PdfFileReader and you can close it. The with
syntax is preferable to do that for you:
def getPDFContent(path):
with open(path, "rb") as f:
content = ""
# Load PDF into pyPDF
pdf = PyPDF2.PdfFileReader(f)
#Check for number of pages, prevents out of bounds errors
max = 0
if pdf.numPages > 3:
max = 3
else:
max = (pdf.numPages - 1)
# Iterate pages
for i in range(0, max):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
Upvotes: 2
Reputation: 140276
When doing this:
pdf = PyPDF2.PdfFileReader(file(path, "rb"))
you're pasing a reference to a handle, but you have no control on when the file will be closed.
You should create a context with the handle instead of passing it anonymously from here:
I would write
with open(path,"rb") as f:
pdf = PyPDF2.PdfFileReader(f)
#Check for number of pages, prevents out of bounds errors
... do your processing
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
# now the file is closed by exiting the block, you can delete it
os.remove(path)
# and return the contents
return content
Upvotes: 2