Kurter21
Kurter21

Reputation: 73

Parse line of text after multiline regex pattern

I am attempting to parse fields from a pdf file converted to txt via pdfbox. Here is an example of a field I need to extract, "BUYER NAME AND ADDRESS:". These documents often contain translations, and the ":" colon appears a variable number of characters after BUYER NAME AND ADDRESS. Example below.

Txt file..
BUYER NAME AND ADDRESS / NOMBRE Y
DIRECCIÓN DEL COMPRADOR:
Name of buyer here
Txt continues..

Here is my attempted pattern / scanning code.

Scanner sc = new Scanner(txtFile);
Pattern p = Pattern.compile("BUYER NAME AND ADDRESS.*:", Pattern.MULTILINE);
sc.findWithinHorizon(p, 0);
String buyer = sc.nextLine();
buyer = sc.nextLine();
System.out.println("Buyer Name: "+buyer);

This works when the text file is english only e.g. BUYER NAME AND ADDRESS: but if there are additional characters or line returns, it fails. How can I fix the pattern?

Upvotes: 1

Views: 291

Answers (1)

maraca
maraca

Reputation: 8743

The given regex "BUYER NAME AND ADDRESS.*:" matches "BUYER NAME AND ADDRESS" followed by any amount of characters followed by a colon, so this will match everything until the last colon because regex are greedy, you could use .*? (non-greedy) to get the desired behavior. Additionally you need to change MULTILINE (^ and $ matches start and end of line) to DOTALL (. also matches newlines) to make this work as @stribizhev said.

This can also be corrected by using [^:], [^...] means not those characters. Like this you don't need any modifiers (I removed the : at the end because you probably don't need it if you do it like this):

"BUYER NAME AND ADDRESS[^:]*"

Upvotes: 1

Related Questions