Karmveer Singh
Karmveer Singh

Reputation: 953

Convert .doc/.docx to text with preserving tables

I want to convert doc/docx files to text files. My requirement is that tables should as it is.

I have tried python tika. It converting the rows to columns

For example table in input doc/docx file

enter image description here

Above the table is converted to text like below

LANGUAGE
UNDERSTAND
LEARN

HINDI
YES
NO

MARATHI
YES
NO

ENGLISH
YES
NO

Desired output is like(preserve table format)

 LANGUAGE    UNDERSTAND      LEARN  
 HINDI   YES     NO
 MARATHI     YES     NO
 ENGLISH     YES     NO

Please let me know if it is possible.

Upvotes: 3

Views: 997

Answers (1)

Rolf of Saxony
Rolf of Saxony

Reputation: 22443

As @ilmiacs suggested pandoc can do this for you.
Using python you need to install pypandoc.
Test document:

enter image description here

import pypandoc
print(pypandoc.convert_file("Untitled 1.docx", "plain+simple_tables", format="docx", extra_args=(), encoding='utf-8', outputfile=None))

gives you:

enter image description here

Clearly, you also have the option of using subprocess to bang this onto the command line.

Upvotes: 6

Related Questions