Text extraction from PDF using PDFBox 2.0

Question

I'm trying to use PDFBox 2.0 for text extraction. I would like to get information on the font size of specific characters and the position rectangle of that character on the page. I've implemented this in PDFBox 1.6 using a PDFTextStripper:

    PDFParser parser = new PDFParser(is);
    try{
        parser.parse();
    }catch(IOException e){

    }
    COSDocument cosDoc = parser.getDocument();
    PDDocument pdd = new PDDocument(cosDoc);
    final StringBuffer extractedText = new StringBuffer();
    PDFTextStripper textStripper = new PDFTextStripper(){
        @Override
        protected void processTextPosition(TextPosition text) {
            extractedText.append(text.getCharacter());
            logger.debug("text position: "+text.toString());
        }
    };
    textStripper.setSuppressDuplicateOverlappingText(false);
    for(int pageNum = 0;pageNum



But in the 2.0 version of PDFBox, the processStream method has been removed.
How can I achieve the same with PDFBox 2.0?

I've tried the following:

        PDDocument pdd = PDDocument.load(inputStream);
        PDFTextStripper textStripper = new PDFTextStripper(){
            @Override
            protected void processTextPosition(TextPosition text){
                int pos = PDFdocument.length();
                String textadded = text.getUnicode();
                Range range = new Range(pos,pos+textadded.length());
                int pagenr = this.getCurrentPageNo();
                Rectangle2D rect = new Rectangle2D.Float(text.getX(),text.getY(),text.getWidth(),text.getHeight());
            }
        };
        textStripper.setSuppressDuplicateOverlappingText(false);
        for(int pageNum = 0;pageNum


The processTextPosition(TextPosition text) method does not get called.
Any suggestions would be very welcome.

Dieudonn&#233; · Accepted Answer

The DrawPrintTextLocations example, suggested by @tilmanhausherr, provided the solution to my problem.

The parser is started using the following code (the inputStream is the input stream from the URL of the PDF file):

    PDDocument pdd = null;
    try {
        pdd = PDDocument.load(inputStream);
        PDFParserTextStripper stripper = new PDFParserTextStripper(PDFdocument,pdd);
        stripper.setSortByPosition(true);
        for (int i=0;i



This code uses a custom subclass of PDFTextStripper:

class PDFParserTextStripper extends PDFTextStripper {

    public PDFParserTextStripper() throws IOException {
        super();
    }


    public void stripPage(int pageNr) throws IOException {
        this.setStartPage(pageNr+1);
        this.setEndPage(pageNr+1);
        Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
        writeText(document,dummy); // This call starts the parsing process and calls writeString repeatedly.
    }



    @Override
    protected void writeString(String string,List textPositions) throws IOException {
        for (TextPosition text : textPositions) {
            System.out.println("String[" + text.getXDirAdj()+","+text.getYDirAdj()+" fs="+text.getFontSizeInPt()+" xscale="+text.getXScale()+" height="+text.getHeightDir()+" space="+text.getWidthOfSpace()+" width="+text.getWidthDirAdj()+" ] "+text.getUnicode());
        }
    }

}

Text extraction from PDF using PDFBox 2.0

Answers (2)

Related Questions