Reputation: 11
I try the following code but it converts only the last pdf in the folder:
import fitz # this is pymupdf
import glob, os
os.chdir('C:/Users/XXXXXXX')
pdfs = []
for file in glob.glob("*.pdf"):
with fitz.open(file) as doc:
text = ""
for page in doc:
text += page.getText()
textfile = open('textfile.txt', 'w',encoding="utf-8")
textfile.write(text)
can you help me ?!?!?!?!
i am using python 3.8!
Upvotes: 0
Views: 1661
Reputation: 824
You need to tell getText
what to get. Then append that text to a list outside of the loop so it is not overwritten. Finally convert that list to a string.
Edit: I've modified my original answer to do what you asked. In order to write them to individual .txt
files you to include the writing of the file into the loop. Don't forget to close textfile
before moving to the next pdf or it will not write the following file.
import fitz
import glob, os
DIR = '\\pdftext\\'
os.chdir(DIR + 'pdf\\')
def listToString(s):
str1 = ""
for ele in s:
str1 += ele
return str1
for file in glob.glob("*.pdf"):
print(file)
filename = os.path.splitext(file)
filename = filename[0]
pdfs = []
with fitz.open(file) as doc:
text = ""
for page in doc:
text += page.getText(text)
pdfs.append(text)
textfile = open(DIR + 'text\\' + filename + '.txt', 'w',encoding="utf-8")
pages = listToString(pdfs)
textfile.write(pages)
textfile.close()
Upvotes: 1
Reputation: 11
i tried:
import sys, fitz
import glob
for fname in glob.glob("C:/Users/XXXXXX/*.pdf"):
doc = fitz.open(fname) # open document
out = open(fname + ".txt", "wb") # open text output
for page in doc: # iterate the document pages
text = page.getText().encode("utf8") # get plain text (is in UTF-8)
out.write(text) # write text of page
out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
out.close()
it works but i still have to test the result :-)
Upvotes: 0
Reputation: 17
If the problem is that your loop isn't working, (and it probably is), you can use os.walk("start_dir")
instead. For example:
for path, dirs, files in os.walk('.'): # All files.
for file in files: # Loop through each file.
with fitz.open(file) as doc: # Open file.
...
Upvotes: 1