Rohit Shelhalkar
Rohit Shelhalkar

Reputation: 766

Apache Tika do not extract first line of the RTF file, It only extract last three char of first line.

I have added the RTF file in comment.Copy the following text in text editor and save as RTF format.

This is a view of RFT file when you will open in any RTF viewer.

BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("level1Missing.rtf"));
ParseContext pcontext = new ParseContext();
RTFParser rt = new RTFParser();
rt.parse(inputstream, handler, metadata, pcontext);
//getting the content of the document
System.out.println("Contents of the PDF :\n\n" + handler.toString());

Output of the above code is.

Upvotes: 3

Views: 605

Answers (1)

Nicomedes E.
Nicomedes E.

Reputation: 1334

In my view, Apache Tika has no problem. The criticality is in the rtf file; there is a \par less before {\line {\b Level1} : \par}.

You can try with this another simple file:

{\rtf1\ansi{\fonttbl\f0\fswiss Helvetica;}\f0\par
This is some {\b bold} text.\par
}

If you remove \par before This is some {\b bold} text.\par, tika will extract the last chars of the first line.

Upvotes: 4

Related Questions