unknown_boundaries
unknown_boundaries

Reputation: 1600

How to compare two pdf documents using Apache Tika

I want to compare two pdf documents (not only contents but also other information such as header footers and styles).

I got to know that we can use Apache tika for comparison purpose. I have learnt to parse the pdf document and able to extract some metadata info such as title, author.

I'm right now able to do like this -

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStream;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;

import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

public class CompareDocs {
    public CompareDocs() {
        super();
    }

    private void parseResource(String resourceName) {  
            System.out.println("Parsing resource : " + resourceName);  
            InputStream inputStream = null;  

            try {  
                try {
                        inputStream = new BufferedInputStream(new FileInputStream(new File(resourceName)));   
                    } catch (FileNotFoundException e) {
                        e.printStackTrace();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }



                Parser parser = new AutoDetectParser();  
                ContentHandler contentHandler = new BodyContentHandler();  
                Metadata metadata = new Metadata();  

                parser.parse(inputStream, contentHandler, metadata, new ParseContext());  

                for (String name : metadata.names()) {  
                    String value = metadata.get(name);  
                    System.out.println("Metadata Name: " + name);  
                    System.out.println("Metadata Value: " + value);  
                }  

                System.out.println("Title: " + metadata.get("title"));  
                System.out.println("Author: " + metadata.get("Author"));  
                System.out.println("content: " + contentHandler.toString());  

            } catch (IOException e) {  
                e.printStackTrace();  
            } catch (TikaException e) {  
                e.printStackTrace();  
            } catch (SAXException e) {  
                e.printStackTrace();  
            } finally {  
                if (inputStream != null) {  
                    try {  
                        inputStream.close();  
                    } catch (IOException e) {  
                        e.printStackTrace();  
                    }  
                }  
            }  
        }  

    public static void main(String[] args) throws Exception {
        CompareDocs apacheTikaParser = new CompareDocs();  
               apacheTikaParser.parseResource("C:\\Users\\prakhar\\Desktop\\beautiful_code.pdf");  
    }
}

How can we extract some more information such as header distance of first section, image height and width etc and compare these with another pdf using Apache Tika.

Upvotes: 3

Views: 2380

Answers (2)

yeaaaahhhh..hamf hamf
yeaaaahhhh..hamf hamf

Reputation: 788

If you want access to more information, maybe it is more convenient to use another api like PDFTextStream. Tika extracts raw textual information from a pdf, while PDFTextStream gives you structured text with correlated info such as character encoding, height, region of the text etc.

Upvotes: 1

SANN3
SANN3

Reputation: 10109

Tika detects and extracts metadata and structured text content. It doesn't support to find header distance of first section, image height and width etc.

You can try PDFBox or Itext.

Upvotes: 7

Related Questions