SimpleProgrammer
SimpleProgrammer

Reputation: 259

Constructing regex in Java with variable number of certain characters in pattern

So a text file is given that should follow some a priori known format. I would like to check that such a text file indeed follows the format by reading each line in the text file and comparing to a regex. So, the first line in each text file is on the following format:

  1. First character is "O" (capital o)
  2. Characters 2-16 are numbers, with the exception of the 6:th character which is a blank space
  3. Characters 17-30 is a decimal number, where character 28 is a decimal point
  4. Characters 31-40 is an integer number
  5. ...

The specification continues, however I only need help with steps 3 and 4. For instance, a decimal number could be 1000.55, but in the text file it would be preceded by 7 blank spaces so that it fits the format. The same goes for step 4: if the number is 10, then this would be preceded by 8 blank spaces in the text file so that it fits.

How can I construct a regex that detects this pattern? Since the number of blank spaces may change, I am not sure. My idea was something like this:

String regex = "O[0-9]{4} [0-9]{10}[ ]*[0-9]*,[0-9]{2}"

The first letter is "O", followed by four digits, then a blank space, then 10 digits, then an unspecified number of blank spaces followed by an unspecified number of digits. Then finally decimal point and two digits. But this does not restrict the decimal number to only 14 characters! This is unfortunate, I do not think it will work.

Upvotes: 0

Views: 657

Answers (1)

The fourth bird
The fourth bird

Reputation: 163362

You could match the first part for which you know the amount of occurrences.

For step 3 and 4 you could make use of positive lookaheads to assert the amount of occurrences.

In Java you could also use \h to match a horizintal whitespace char.

^O\d{4} \d{10}(?=[ \d]{11}\.) *\d*\.\d\d(?=[ \d]{10}) {0,9}\d+

In Java with the doubled backslashes:

String regex = "^O\\d{4} \\d{10}(?=[ \\d]{11}\\.) *\\d*\\.\\d\\d(?=[ \\d]{10}) {0,9}\\d+";
  • ^O Match O at the start of the string
  • \d{4} \d{10} Match 4 digits, a space and 10 digits
  • (?=[ \d]{11}\.)
  • *\d*\.\d\d Match optional spaces . and 2 digits (If only .22 should also match)
  • (?=[ \d]{10}) Positive lookahead, assert 10 occurrences of either a space or digit to the right from the current position
  • {0,9}\d+ Match 0-9 spaces and 1+ digits

Regex demo

If the length of the string is a total of 40 characters, you can use a single lookahead (?=[ \d]{11}\.) because the string length is 40 characters.

^O(?=[\d .]{39}$)\d{4} \d{10}(?=[ \d]{11}\.) *\d*\.\d\d *\d+$

Regex demo

Upvotes: 1

Related Questions