Reputation: 69
I have several articles in a single pdf file and I am trying to separate those articles and write them to separate Docx files. I managed to separate them using regex but when I try to write them to docx files, it throws this error: ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters.
My code is as follows:
my_path = "/path/to/pdf"
newpath = textract.process(my_path)
newpath2 = newpath.decode("UTF-8")
result = re.findall(r'\d+ words(.*?)Document \w+', newpath2, re.DOTALL)
save_path = "/path/to/write/docx/files/"
for each in result:
import time
time=str(time.time())
finalpath = (os.path.join(save_path, time))
finalpath2 = finalpath+".docx"
mydoc = docx.Document()
mydoc.add_paragraph(each)
mydoc.save(finalpath2)
Upvotes: 2
Views: 613
Reputation: 626794
You can remove all null and control byte chars and use
.add_paragraph(remove_control_characters(each.replace('\x00','')))
The remove_control_characters
function can be borrowed from Removing control characters from a string in python thread.
Code snippet:
import unicodedata
def remove_control_characters(s):
return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")
my_path = "/path/to/pdf"
newpath = textract.process(my_path)
newpath2 = newpath.decode("UTF-8")
result = re.findall(r'\d+ words(.*?)Document \w+', newpath2, re.DOTALL)
save_path = "/path/to/write/docx/files/"
for each in result:
import time
time=str(time.time())
finalpath = (os.path.join(save_path, time))
finalpath2 = finalpath+".docx"
mydoc = docx.Document()
mydoc.add_paragraph(remove_control_characters(each.replace('\x00','')))
mydoc.save(finalpath2)
Upvotes: 2