Reputation: 1600
I want to compare two pdf documents (not only contents but also other information such as header footers and styles).
I got to know that we can use Apache tika for comparison purpose. I have learnt to parse the pdf document and able to extract some metadata info such as title, author.
I'm right now able to do like this -
import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
public class CompareDocs {
public CompareDocs() {
super();
}
private void parseResource(String resourceName) {
System.out.println("Parsing resource : " + resourceName);
InputStream inputStream = null;
try {
try {
inputStream = new BufferedInputStream(new FileInputStream(new File(resourceName)));
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
Parser parser = new AutoDetectParser();
ContentHandler contentHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
parser.parse(inputStream, contentHandler, metadata, new ParseContext());
for (String name : metadata.names()) {
String value = metadata.get(name);
System.out.println("Metadata Name: " + name);
System.out.println("Metadata Value: " + value);
}
System.out.println("Title: " + metadata.get("title"));
System.out.println("Author: " + metadata.get("Author"));
System.out.println("content: " + contentHandler.toString());
} catch (IOException e) {
e.printStackTrace();
} catch (TikaException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} finally {
if (inputStream != null) {
try {
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
public static void main(String[] args) throws Exception {
CompareDocs apacheTikaParser = new CompareDocs();
apacheTikaParser.parseResource("C:\\Users\\prakhar\\Desktop\\beautiful_code.pdf");
}
}
How can we extract some more information such as header distance of first section, image height and width etc and compare these with another pdf using Apache Tika.
Upvotes: 3
Views: 2380
Reputation: 788
If you want access to more information, maybe it is more convenient to use another api like PDFTextStream. Tika extracts raw textual information from a pdf, while PDFTextStream gives you structured text with correlated info such as character encoding, height, region of the text etc.
Upvotes: 1