Subbu
Subbu

Reputation: 109

Retrieve a particular portion of data from pdf

I need to retrieve some keyword related data from a pdf file. These are the keywords:Title,Scope of pdf,who proposed that pdf,version,summary,state,regulator.

Is there any tool to retrieve data from pdf? Thanks in Advance

Upvotes: 1

Views: 2265

Answers (3)

newuser
newuser

Reputation: 8466

Use PDFBOX

public class PDFTextReader
{
   static String pdftoText(String fileName) {
        PDFParser parser;
        String parsedText = null;
        PDFTextStripper pdfStripper = null;
        PDDocument pdDoc = null;
        COSDocument cosDoc = null;
        File file = new File(fileName);
        if (!file.isFile()) {
            System.err.println("File " + fileName + " does not exist.");
            return null;
        }
        try {
            parser = new PDFParser(new FileInputStream(file));
        } catch (IOException e) {
            System.err.println("Unable to open PDF Parser. " + e.getMessage());
            return null;
        }
        try {
            parser.parse();
            cosDoc = parser.getDocument();
            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(cosDoc);
            // pdfStripper.setParagraphStart(FIND_START_VALUE);
            // pdfStripper.setParagraphEnd("FIND_END_VALUE);
            parsedText = pdfStripper.getText(pdDoc);
        } catch (Exception e) {
            System.err
                    .println("An exception occured in parsing the PDF Document."
                            + e.getMessage());
        } finally {
            try {
                if (cosDoc != null)
                    cosDoc.close();
                if (pdDoc != null)
                    pdDoc.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
        return parsedText;
    }
    public static void main(String args[]){

        System.out.println(pdftoText(FILEPATH));
    } 
}

Here i tried this to extract the portion. This may help you.

Upvotes: 0

saurav
saurav

Reputation: 3462

You can use PDFBox from Apache , honestly speaking i have never used it but read lot about it on the forums.

Other alternative can be iText or JPedal.

If you are interested you can give a try with those , but I am confident that with PDFBox you will be able to meet your requirements.

Thanks

Upvotes: 2

user784540
user784540

Reputation:

Consider Apache PDFBox

Extract text from PDF and then parse it to get information you want. It is free.

Also there's another tool, iText but if you are working on a commercial project you need to buy a license on iText.

Upvotes: 0

Related Questions