Reputation: 1921
I have a website where users upload documents in .doc and .pdf format. I am using Sphinx to conduct full text searches on my SQL database (MySQL). What is the best way to index these file formats with Sphinx?
Upvotes: 10
Views: 7964
Reputation: 343
Has anyone used Tika to index other types of documents, much like the SOLR plugin? Apache Tika
Some links:
Upvotes: 1
Reputation: 18166
The method I use for this is pdf2text and antiword. I use both of these to dump the contents of the pdfs and word documents into the database. From there it's easy to crawl with Sphinx.
Upvotes: 9
Reputation: 16226
Unfortunately, Sphinx can't index those file types directly. You'll need to either import the textual contents into a database, or into an XML format that Sphinx can understand.
Upvotes: 6