Peter Holdensgaard
Peter Holdensgaard

Reputation: 57

Search pattern within String in JAVA

I'm using PDFBox in java and successfully retrieved a pdf. But now I wish to search for a specific word and only retrieve the following number. To be concrete, I want to search for Tax and retrieve the number that is tax. The two strings are separated by a tab it seems.

My code is as following atm

  File file = new File("yes.pdf");
try {
     PDDocument document = PDDocument.load(file);
     PDFTextStripper pdfStripper = new PDFTextStripper();

String text = pdfStripper.getText(document);

System.out.println(text);

// search for the word tax
// retrieve the number af the word "Tax"

document.close();
}

Upvotes: 0

Views: 545

Answers (2)

Mori Manish
Mori Manish

Reputation: 179

I have used similar thing in my project. I hope it will help you.

public class ExtractNumber {

public static void main(String[] args) throws IOException { 
    PDDocument doc = PDDocument.load(new File("yourFile location"));

    PDFTextStripper stripper = new PDFTextStripper();
    List<String> digitList = new ArrayList<String>();

    //Read Text from pdf 
    String string = stripper.getText(doc);

    // numbers follow by string
    Pattern mainPattern = Pattern.compile("[a-zA-Z]\\d+");

    //Provide actual text
    Matcher mainMatcher = mainPattern.matcher(string);
    while (mainMatcher.find()) {
        //Get only numbers
        Pattern subPattern = Pattern.compile("\\d+");
        String subText = mainMatcher.group();
        Matcher subMatcher = subPattern.matcher(subText);
        subMatcher.find();
        digitList.add(subMatcher.group());
    }

    if (doc != null) {
        doc.close();
    }

    if(digitList != null && digitList.size() > 0 ) {
        for(String digit: digitList) {
            System.out.println(digit);
        }
    }
}

}

Regular expression [a-zA-Z]\d+ find one or more digit follow by latter from pdf text.

\d+ expression find specific text from above pattern.

you can also use different regular expression for find specific number of digit.

You can get more idea from this tutorial.

Upvotes: 3

Lazar Petrovic
Lazar Petrovic

Reputation: 537

The best way to do something like that is to use regular expressions. I often use this tool to write my regular expressions. Your regex should probably look something like: tax\s([0-9]+). You can take a look at this tutorial on how to use regex in Java.

Upvotes: 2

Related Questions