Reputation: 1705
I have a pdf file which is 122 pages. When I parse it using Tika
(version 1.17), it doesn't return the whole text in the returned string
.
I use the following simple code to get the text:
String content = new Tika().parseToString(file);
The text that I get with this code, ends at around page 118. That is, the last pages are ignored.
Upvotes: 0
Views: 1347
Reputation: 48346
Promoting a comment to an answer...
Apache Tika will by default set a maximum size of text it'll allow a parser to generate, to avoid accidentally swamping a user. In your case, it looks like you're hitting that limit when you really do want more!
As a user of the Tika facade helper class, you just need to call Tika.setMaxStringLength(int) with a higher limit, or -1
just to disable the limits entirely
If you're using the Tika parser classes directly, then you should set a higher write limit (or -1
) to your content handler, eg BodyContentHandler(int writeLimit)
Upvotes: 3