peppe
peppe

Reputation: 11

Convert multiple pdfs to txt in a folder PYTHON

I try the following code but it converts only the last pdf in the folder:

import fitz  # this is pymupdf
import glob, os
os.chdir('C:/Users/XXXXXXX')
pdfs = []
for file in glob.glob("*.pdf"):
 with fitz.open(file) as doc:
    text = ""
    for page in doc:
        text += page.getText()
textfile = open('textfile.txt', 'w',encoding="utf-8")
textfile.write(text)

can you help me ?!?!?!?!

i am using python 3.8!

Upvotes: 0

Views: 1661

Answers (3)

rzaratx
rzaratx

Reputation: 824

You need to tell getText what to get. Then append that text to a list outside of the loop so it is not overwritten. Finally convert that list to a string.

Edit: I've modified my original answer to do what you asked. In order to write them to individual .txt files you to include the writing of the file into the loop. Don't forget to close textfile before moving to the next pdf or it will not write the following file.

import fitz
import glob, os

DIR = '\\pdftext\\'
os.chdir(DIR + 'pdf\\')

def listToString(s):  
    str1 = ""  
    for ele in s:  
        str1 += ele   
    return str1  

for file in glob.glob("*.pdf"):
    print(file)
    filename = os.path.splitext(file)
    filename = filename[0]
    pdfs = []

    with fitz.open(file) as doc:
        text = ""
        for page in doc:
            text += page.getText(text)
            pdfs.append(text)
        
        textfile = open(DIR + 'text\\' + filename + '.txt', 'w',encoding="utf-8")
    pages = listToString(pdfs)
    textfile.write(pages)
    textfile.close()

Upvotes: 1

peppe
peppe

Reputation: 11

i tried:

import sys, fitz
import glob
for fname in glob.glob("C:/Users/XXXXXX/*.pdf"):

doc = fitz.open(fname) # open document
out = open(fname + ".txt", "wb") # open text output
for page in doc: # iterate the document pages
    text = page.getText().encode("utf8") # get plain text (is in UTF-8)
    out.write(text) # write text of page
    out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
 out.close()

it works but i still have to test the result :-)

Upvotes: 0

BerkZerker707
BerkZerker707

Reputation: 17

If the problem is that your loop isn't working, (and it probably is), you can use os.walk("start_dir") instead. For example:

for path, dirs, files in os.walk('.'):  # All files.
    for file in files:  # Loop through each file.
        with fitz.open(file) as doc:  # Open file.
            ... 

Upvotes: 1

Related Questions