Jared Brown
Jared Brown

Reputation: 1921

Indexing Word Documents and PDFs with Sphinx

I have a website where users upload documents in .doc and .pdf format. I am using Sphinx to conduct full text searches on my SQL database (MySQL). What is the best way to index these file formats with Sphinx?

Upvotes: 10

Views: 7964

Answers (3)

Wadester
Wadester

Reputation: 343

Has anyone used Tika to index other types of documents, much like the SOLR plugin? Apache Tika

Some links:

  1. PDF2TEXT is in poppler or poppler-utils on Linux
  2. ANTIWORD -- seems to be for old .doc, not newer .docx

Upvotes: 1

mlissner
mlissner

Reputation: 18166

The method I use for this is pdf2text and antiword. I use both of these to dump the contents of the pdfs and word documents into the database. From there it's easy to crawl with Sphinx.

Upvotes: 9

pat
pat

Reputation: 16226

Unfortunately, Sphinx can't index those file types directly. You'll need to either import the textual contents into a database, or into an XML format that Sphinx can understand.

Upvotes: 6

Related Questions