Reputation: 953
I want to convert doc/docx files to text files. My requirement is that tables should as it is.
I have tried python tika. It converting the rows to columns
For example table in input doc/docx file
Above the table is converted to text like below
LANGUAGE
UNDERSTAND
LEARN
HINDI
YES
NO
MARATHI
YES
NO
ENGLISH
YES
NO
Desired output is like(preserve table format)
LANGUAGE UNDERSTAND LEARN
HINDI YES NO
MARATHI YES NO
ENGLISH YES NO
Please let me know if it is possible.
Upvotes: 3
Views: 997
Reputation: 22443
As @ilmiacs suggested pandoc
can do this for you.
Using python
you need to install pypandoc
.
Test document:
import pypandoc
print(pypandoc.convert_file("Untitled 1.docx", "plain+simple_tables", format="docx", extra_args=(), encoding='utf-8', outputfile=None))
gives you:
Clearly, you also have the option of using subprocess
to bang this onto the command line.
Upvotes: 6