Reputation: 4995
I know Tika has a very nice wrapper that let's me get a Reader back from parsing a file like so:
Reader parsedReader = tika.parse(in);
However, if I use this, I cannot specify the parser that I want and the metadata that I want to pass in. For example, I would want to pass in extra info like which handler, parser, and context to use, but I can't do it if I use this method. As far as I know, it's the only one that let's me get a Reader instance back and read incrementally instead of getting the entire parsed string back.
Example of things I want to include:
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, fileName); //This aids in the content detection
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
parser.parse(is, handler, metadata, context);
However, calling parse on a parser directly does not return a reader, and the only option I have(noticed in the docs) is to return a fully parsed string, which might not be great for memory usage. I know I can limit the string that is returned, but I want to stay away from that as I wanto the fully parsed info, but in incremental fashion. Best of both world, is this possible?
Upvotes: 2
Views: 557
Reputation: 48346
One of the many great things about Apache Tika is that it's open source, so you can see how it works. The class for the Tika facade you're using is here
The key bit of that class for your interest is this bit:
public Reader parse(InputStream stream, Metadata metadata)
throws IOException {
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
return new ParsingReader(parser, stream, metadata, context);
}
You see there how Tika is taking a parser and a stream, and processing it to a Reader. Do something similar and you're set. Alternately, write your own ContentHandler and call that directly for full control!
Upvotes: 1