Reputation: 313
I have been using Apache Tika for extracting text from different document formats. Now i want to make it handle header, footer and text boxes differently. So i downloaded source code of Tika from GitHub and trying to make changes in it.
I want to run Apache Tika source code from Eclipse and debug its execution by passing an input document. How can i do that? There are so many main classes. Where do i start? I understand its a Maven project and i am new to it.
And once i make changes how can i create new jar file?
Upvotes: 0
Views: 175
Reputation: 670
Take a look at Tika's xhtml output first, maybe it extracts headers/footers and you can use parser API to handle these parts as you wish. If it's that way, use API as examples say passing custom SAX-like handler to it.
Upvotes: 1