How to extract data from PDF and split into particluar categories using java

Question

I am trying to extract data from PDF and splitting it into certain categories.I am able to extract data from PDF and Split it into categories on basis of their font size. For example:Lets say there are 3 category, Country category, capital category and city category. I am able to put all countries, capitals and cities into their respective categories. But I am not able to map which capital belong to which city and which Country or which country belong which city and capital. *It is reading data randomly, How I can Read data from bottom to Top without breaking the sequence, so I can Put first word in first category, 2nd into second and so on. *

Or anyone know some more efficient way? so I can put text into their respective categories and map it also.

I am using Java and Here is my code:

public class readPdfText {


public static void main(String[] args) {

    try{
        PdfReader reader = null;

    String src = "pdffile.pdf";
    try {

        reader = new PdfReader("pdfile.pdf");
    } catch (IOException e1) {
        // TODO Auto-generated catch block
        e1.printStackTrace();
    }
    SemTextExtractionStrategy  smt = new SemTextExtractionStrategy();


        for (int i = 1; i <= reader.getNumberOfPages(); i++) {

      PdfTextExtractor.getTextFromPage(reader, i, smt);

        }

    }catch(Exception e){

    }
}

}

SemTextExtractionStrategy class:

 public class SemTextExtractionStrategy implements TextExtractionStrategy {

private String text;
StringBuffer str = new StringBuffer();
StringBuffer item = new StringBuffer();
StringBuffer cat = new StringBuffer();
StringBuffer desc = new StringBuffer();
float temp = 0;

@Override
public void beginTextBlock() {
}

@Override
public void renderText(TextRenderInfo renderInfo) {

    text = renderInfo.getText();

    Vector curBaseline = renderInfo.getBaseline().getStartPoint();
    Vector topRight = renderInfo.getAscentLine().getEndPoint();

    Rectangle rect = new Rectangle(curBaseline.get(0), curBaseline.get(1),
            topRight.get(0), topRight.get(1));
    float curFontSize = rect.getHeight();

    compare(text, curFontSize);


}

private void add(String text2, float curFontSize) {

    str.append(text2);
    System.out.println("str: " + str);
}

public void compare(String text2, float curFontSize) {
    // text2.getFont().getBaseFont().Contains("bold");
    // temp = curFontSize;
    boolean flag = check(text);
    if (temp == curFontSize) {

        str.append(text);

        /*
         * if (curFontSize == 11.222168){ item.append(str);
         * System.out.println(item); }else if (curFontSize == 10.420532){
         * desc.append(str); }
         */
        // str.append(text);
    } else {


         if (temp>9.8 && temp<10){
             String Contry= str.toString();
             System.out.println("Contry: "+Contry);

         }else if(temp>8 && temp <9){
             String itemPrice= str.toString();
             System.out.println("itemPrice: "+itemPrice);
         }else if(temp >7 && temp< 7.2){
             String captial= str.toString();
             System.out.println("captial: "+captial);
         }else if(temp >7.2 && temp <8){
             String city= str.toString();
             System.out.println("city: "+city);
         }else{
             System.out.println("size: "+temp+"   "+"str: "+str);
         }
        temp = curFontSize;
        // System.out.println(temp);
        str.delete(0, str.length());

        str.append(text);
    }

}

private boolean check(String text2) {

    return true;
}

@Override
public void endTextBlock() {
}

@Override
public void renderImage(ImageRenderInfo renderInfo) {
}

@Override
public String getResultantText() {
    return text;
}

}

How to extract data from PDF and split into particluar categories using java

Answers (1)

Related Questions