mohjak
mohjak

Reputation: 141

Antiword can't open 'C:\\?????? ????????\\info.doc' for reading in Windows

Description

I am using texttract python library to extract word document text. The problem is that: if the path contains arabic characters, then, antiword outputs that can't read the document.

Example

import textract

# path = 'C:\\test-docs\\info.doc'
path = 'C:\\مجلدات اختبارية\\info.doc'
text = textract.process(path, encoding='UTF-8')

print(text)

Error

Traceback (most recent call last):
  File "c:\test-extract-doc.py", line 5, in <module>
    text = textract.process(path, encoding='UTF-8')
  File "C:\Users\mohja\AppData\Local\Programs\Python\Python39\lib\site-packages\textract\parsers\__init__.py", line 77, in process 
    return parser.process(filename, encoding, **kwargs)
  File "C:\Users\mohja\AppData\Local\Programs\Python\Python39\lib\site-packages\textract\parsers\utils.py", line 46, in process    
    byte_string = self.extract(filename, **kwargs)
  File "C:\Users\mohja\AppData\Local\Programs\Python\Python39\lib\site-packages\textract\parsers\doc_parser.py", line 9, in extract
    stdout, stderr = self.run(['antiword', filename])
  File "C:\Users\mohja\AppData\Local\Programs\Python\Python39\lib\site-packages\textract\parsers\utils.py", line 100, in run       
    raise exceptions.ShellError(
textract.exceptions.ShellError: The command `antiword C:\مجلدات اختبارية\info.doc` failed with exit code 1
------------- stdout -------------
b''------------- stderr -------------
b"I can't find the name of your HOME directory\r\nI can't open 'C:\\?????? ????????\\info.doc' for reading\r\n"

Notes

Upvotes: 1

Views: 1709

Answers (1)

Tomalak
Tomalak

Reputation: 338108

After digging into the source code of textract, it becomes clear that for extraction from .doc the (ancient) command line tool antiword is used.

class Parser(ShellParser):
    """Extract text from doc files using antiword.
    """

    def extract(self, filename, **kwargs):
        stdout, stderr = self.run(['antiword', filename])
        return stdout

Python does everything properly, but apparently antiword itself has issues with the way it parses its arguments, at least on Windows, so passing a Unicode path results in breakage.

Luckily Windows offers a way of converting any path into a backwards-compatible form of ANSI-only 8.3 filenames - the so-called "short" paths, which can be requested from the system with a Win32 API call. Short paths and regular ("long") paths are interchangeable, but legacy software might like short paths better.

This provides a work-around: Retrieve the short path for any .doc file and give that to antiword instead. Win32 API calls are supplied in Python by the win32api module:

from win32api import GetShortPathName 

def extract_text(path):
    if path.lower().endswith(".doc"):
        path = GetShortPathName(path)

    return textract.process(path, encoding='UTF-8')

text = extract_text('C:\\مجلدات اختبارية\\info.doc')
print(text)

Upvotes: 2

Related Questions