Reputation: 1228
I am trying to use this regex
(\d{4}-\d{2}-\d{2})\s(\d{2}\:\d{2}\:\d{2})\.\d\s([A-Z]{3})((\d{1,3}\.){3}\d{1,3}(?:(:\d{1,5})?))\s?(.*)
to match a string such as
2021-05-20 11:03:00.0 GMT222.111.222.33
and
2021-05-20 11:03:00.0 GMT222.111.222.33:2323
In the case of this string:
2021-05-20 11:03:00.0 GMT222.111.222.33:2323444 first second
When I enter a port number with more than 5 digits, the extra digits are automatically included with the next matching group 44 first second
instead of not matching the entire string at all. The port number is meant to be optional (matching it with the IP group if it exists).
The string after the port number is optional as well but, it should be matched when present.
There is usually a space between IP (and port) and the text after, therefore
Standard text should be:
2021-05-20 11:03:00.0 GMT222.111.222.33
2021-05-20 11:03:00.0 GMT222.111.222.33 first second
2021-05-20 11:03:00.0 GMT222.111.222.33:23232
2021-05-20 11:03:00.0 GMT222.111.222.33:23232 first second
Example of a string that should not match:
2021-05-20 11:03:00.0 GMT222.111.222.33: first second
- If the colon is present but the port is absent or more than 5 digits.
``2021-05-20 11:03:00.0 GMT222.111.222.33first secondor
2021-05-20 11:03:00.0 GMT222.111.222.33:23232first second`
How may I fix this?
Upvotes: 0
Views: 97
Reputation: 163207
You get a partial match because there is no clear boundary, and using \s?(.*)
at the end are both optional.
The .*
can match any character, so the digits after matching :\d{1,5}
will land in that last group.
Also note that you don't have to escape the colon \:
You might use word boundaries and at the end use a negative lookahead to assert not :
followed by a digit directly to the right \b(?!:\d)
\b(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2}:\d{2})\.\d\s([A-Z]{3})((?:\d{1,3}\.){3}\d{1,3}(:\d{1,5})?)\b(?!:\d)
See a regex demo.
EDIT
After the updated question, you might use:
^(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2}:\d{2})\.\d\s([A-Z]{3})((?:\d{1,3}\.){3}\d{1,3}(:\d{1,5})?)(?!\S)(.*)$
^
Start of string(\d{4}-\d{2}-\d{2})\s
Capture group 1(\d{2}:\d{2}:\d{2})
Capture group 2\.\d\s
Match .
digit and whitespace char([A-Z]{3})
Capture group 3, match 3 uppercase chars(
Capture group 4
(?:\d{1,3}\.){3}\d{1,3}
Match an ip like format(:\d{1,5})?
Optional group 5, match :
1-5 digits)(?!\S)
A word boundary, assert a whitespace boundary to the right(.*)
Capture group 6, match any characters$
End of stringUpvotes: 2
Reputation: 6551
This is the fix:
^(\d{4}-\d{2}-\d{2})\s(\d{2}\:\d{2}\:\d{2})\.\d\s([A-Z]{3})((\d{1,3}\.){3}\d{1,3}((:\d{1,5})?))$
Example: https://regex101.com/r/oF7HI2/1
Upvotes: 2
Reputation:
I'm not sure I fully understand the problem but this could be the solution:
(\d{4}-\d{2}-\d{2})\s(\d{2}\:\d{2}\:\d{2})\.\d\s([A-Z]{3})((\d{1,3}\.){3}\d{1,3}(?:(:\d{1,})?))\s?(?>.*?)
Upvotes: 2