Reputation: 23078
I am developing a full text search engine for indexing popular binary formats. I know that there are hundereds of such questions (and solutions) already, but I found it tough to find one:
Upvotes: 0
Views: 2686
Reputation: 2109
Textract uses the default tools for every kind of file.
https://github.com/deanmalmgren/textract
Upvotes: 0
Reputation: 20354
One possible solution is to use google documents to extract the text contents from binary .doc-files. You upload the document to google docs and then download the text contents. It is a fairly slow process, but it is the only "pure Python" solution I know of since it doesn't require any external tools except for network access. An external tool such as catdoc or antiword is a much better solution if you are allowed to install it on your host.
Upvotes: 0
Reputation: 100756
.doc
files..doc
files: antiword and catdoc (and probably others). If the packages are installed on your shared host, you could use subprocess
to shell out to these tools. Available on Windows via Cygwin.subprocess
.Upvotes: 1
Reputation: 54292
If at server side you can use OpenOffice then you can use unoconv: Convert between any document format supported by OpenOffice
Upvotes: 0