nora
nora

Reputation: 3

Is there a way to get all styles from a doc file with Apache Tika?

I was parsing .doc files with POI and when text decorations came into play, it led me to Apache Tika. I can now extract text with simple text decorations like <i></i>, however, I would like to be able to handle more complex styles. My document contains different font sizes, subscript, superscript and so on. Is there a way to get all this information with Tika? And if not, can anyone point me to a more suitable tool to employ?

Upvotes: 0

Views: 765

Answers (1)

Tim Allison
Tim Allison

Reputation: 635

Tika doesn't handle much more than <i> and <b> at the moment, as you've found. Depending on the complexity of the documents, you might consider using POI directly (use Tika's parsers as examples, perhaps). You could also ask on the tika dev list ([email protected]) if there would be interest in adding other formatting features into Tika, or perhaps open a ticket on our Jira site.

Upvotes: 1

Related Questions