Reputation: 21
I'm trying to parse a RTF file using Apache Tika. Inside the file there is a table with several columns.
The problem is that the parser writes out the result without any information in which column the value was.
What I'm doing right now is:
AutoDetectParser adp = new AutoDetectParser(tc);
Metadata metadata = new Metadata();
String mimeType = new Tika().detect(file);
metadata.set(Metadata.CONTENT_TYPE, mimeType);
BodyContentHandler handler = new BodyContentHandler();
InputStream fis = new FileInputStream(file);
adp.parse(fis, handler, metadata, new ParseContext());
fis.close();
System.out.println(handler.toString());
It works but I need to know like meta-information.
Is there already a Handler which outputs something like HTML with a structure of the read RTF file?
Upvotes: 2
Views: 3559
Reputation: 48346
I would suggest that rather than asking Tika for the plain text version, then wondering where all your nice HTML information has gone, you instead just ask Tika for the document as XHTML. You'll then be able to process that to find the information you want on your RTF File
If you look at the Tika Examples or the Tika Unit Tests, you'll see this same pattern for an easy way to get the XHTML output
Metadata metadata = new Metadata();
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "no");
handler.setResult(new StreamResult(sw));
parser.parse(input, handler, metadata, new ParseContext());
String xhtml = sw.toString();
Upvotes: 3