Reading many pdf files in python

Question

Having bunch of PDF files with text in one directory. My idea to be able to read them all at once and save in a dictionary. Now I'm able to do it only one by one by using textract library like this:

import textract

text = textract.process('/Users/user/Documents/Data/CLAR.pdf', 
                        method='tesseract', 
                        language='eng')

How is it possible to read them at once? Do I need to use for loops for searching in directory or smth other way?

An economist · Accepted Answer

One solution might be using os library with for loop

import os
import textract

files_path = [os.path.abspath(x) for x in os.listdir()]

# Excluding not .pdf files
files_path = [pdf for pdf in files_path if '.pdf' in pdf]

pdfs = []
for file in files_path:
    text = textract.process(file,
                            method='tesseract',
                            language='eng')

    pdfs += [text]

Get all files in the current directory
Exclude not .pdf files
Save the text into a list (could be different data structure)

Reading many pdf files in python

Answers (1)

Related Questions