Reputation: 73
I am attempting to parse fields from a pdf file converted to txt via pdfbox. Here is an example of a field I need to extract, "BUYER NAME AND ADDRESS:". These documents often contain translations, and the ":" colon appears a variable number of characters after BUYER NAME AND ADDRESS. Example below.
Txt file..
BUYER NAME AND ADDRESS / NOMBRE Y
DIRECCIÓN DEL COMPRADOR:
Name of buyer here
Txt continues..
Here is my attempted pattern / scanning code.
Scanner sc = new Scanner(txtFile);
Pattern p = Pattern.compile("BUYER NAME AND ADDRESS.*:", Pattern.MULTILINE);
sc.findWithinHorizon(p, 0);
String buyer = sc.nextLine();
buyer = sc.nextLine();
System.out.println("Buyer Name: "+buyer);
This works when the text file is english only e.g. BUYER NAME AND ADDRESS: but if there are additional characters or line returns, it fails. How can I fix the pattern?
Upvotes: 1
Views: 291
Reputation: 8743
The given regex "BUYER NAME AND ADDRESS.*:"
matches "BUYER NAME AND ADDRESS" followed by any amount of characters followed by a colon, so this will match everything until the last colon because regex are greedy, you could use .*?
(non-greedy) to get the desired behavior. Additionally you need to change MULTILINE (^
and $
matches start and end of line) to DOTALL (.
also matches newlines) to make this work as @stribizhev said.
This can also be corrected by using [^:]
, [^...]
means not those characters. Like this you don't need any modifiers (I removed the :
at the end because you probably don't need it if you do it like this):
"BUYER NAME AND ADDRESS[^:]*"
Upvotes: 1