Jenny_V
Jenny_V

Reputation: 73

Converting a PDF file to a Text file in Python

I've been on it for several days + researching the internet on how to get specific information from a pdf file.

Eventually I was able to fetch all information using Python from a text file(which I created by going to the PDF file -----> File ------> Save as Text).

The question is how do I get Python to accomplish those tasks(Going to the PDF file(opening it - is quite easy open("file path"), clicking on File in the menu, and then saving the file as a text file in the same directory).

Just to be clear, I do not require the pdfminer or pypdf libraries as I have already extracted the information with the same file(after converting it manually to txt)

Upvotes: 3

Views: 5014

Answers (2)

piyush tiwari
piyush tiwari

Reputation: 31

You can use "tabula" python library. which basically uses Java though so you have to install Java SDK and JDK. "pip install tabula" and import it to the python script then you can convert pdf to txt file as: tabula.convert_into("path_or_name_of_pdf.pdf", "output.txt", output_format="csv", pages='all') You can see other functions on google. It worked for me. Cheers!!!

Upvotes: 0

pawelty
pawelty

Reputation: 1000

You could use pdftotext.exe that you can download from http://www.foolabs.com/xpdf/download.html and then execute it on your pdf files via Python:

import os
import glob
import subprocess

#remember to put your pdftotxt.exe to the folder with your pdf files 
for filename in glob.glob(os.getcwd() + '\\*.pdf'):
    subprocess.call([os.getcwd() + '\\pdftotext', filename, filename[0:-4]+".txt"])

At least it worked for one of my projects.

Upvotes: 1

Related Questions