JaySabir
JaySabir

Reputation: 322

How to convert multiple .doc files to .docx using antiword?

This manual command is working:

!antiword "test" > "test.docx"

but the following script convert files to empty .docx files:

for file in os.listdir(directory):
    subprocess.run(["bash", "-c", "antiword \"$1\" > \"$1\".docx", "_", file])

also it stores the .docx file in the previous directly e-g file is in \a\b this command will store the files to \a

I have tried many different ways including running directly on terminal adn bash loops. ony the manual way works.

Upvotes: 1

Views: 1020

Answers (2)

DrIDK
DrIDK

Reputation: 7944

use Apache Tika + parallel + pandoc : ( antiword doesn't work well for all kind of doc )

parallel "java -jar tika-app-3.0.0.jar -T {}|pandoc --to docx > {.}.docx" :::*.doc

https://tika.apache.org/
https://pandoc.org/

Upvotes: 1

AKX
AKX

Reputation: 168814

Something like this should work (adjust dest_path etc. accordingly).

import os
import shlex

for filename in os.listdir(directory):
    if ".doc" not in filename:
        continue
    path = os.path.join(directory, filename)
    dest_path = os.path.splitext(path)[0] + ".txt"
    cmd = "antiword %s > %s" % (shlex.quote(path), shlex.quote(dest_path))
    print(cmd)
    # If the above seems to print correct commands, add:
    # os.system(cmd)

Upvotes: 2

Related Questions