Keithx
Keithx

Reputation: 3148

Reading many pdf files in python

Having bunch of PDF files with text in one directory. My idea to be able to read them all at once and save in a dictionary. Now I'm able to do it only one by one by using textract library like this:

import textract

text = textract.process('/Users/user/Documents/Data/CLAR.pdf', 
                        method='tesseract', 
                        language='eng')

How is it possible to read them at once? Do I need to use for loops for searching in directory or smth other way?

Upvotes: 2

Views: 1174

Answers (1)

An economist
An economist

Reputation: 1311

One solution might be using os library with for loop

import os
import textract

files_path = [os.path.abspath(x) for x in os.listdir()]

# Excluding not .pdf files
files_path = [pdf for pdf in files_path if '.pdf' in pdf]

pdfs = []
for file in files_path:
    text = textract.process(file,
                            method='tesseract',
                            language='eng')

    pdfs += [text]
  1. Get all files in the current directory
  2. Exclude not .pdf files
  3. Save the text into a list (could be different data structure)

Upvotes: 3

Related Questions