Apache Tika do not extract first line of the RTF file, It only extract last three char of first line.

Question

I have added the RTF file in comment.Copy the following text in text editor and save as RTF format.

BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("level1Missing.rtf"));
ParseContext pcontext = new ParseContext();
RTFParser rt = new RTFParser();
rt.parse(inputstream, handler, metadata, pcontext);
//getting the content of the document
System.out.println("Contents of the PDF :

" + handler.toString());

Nicomedes E. · Accepted Answer

In my view, Apache Tika has no problem. The criticality is in the rtf file; there is a \par less before {\line {\b Level1} : \par}.

You can try with this another simple file:

{\rtf1\ansi{\fonttbl\f0\fswiss Helvetica;}\f0\par
This is some {\b bold} text.\par
}

If you remove \par before This is some {\b bold} text.\par, tika will extract the last chars of the first line.

Apache Tika do not extract first line of the RTF file, It only extract last three char of first line.

Answers (1)

Related Questions