Reputation: 3571
How can I write a regular expression that spots incorrect usage of a comma in a string, ie.: 1. for non-numbers, no space before and 1 space after; 2. for numbers, commas are allowed if preceded by 1-3 digits and followed by 3 digits.
Some test cases:
So I thought I'd have a regex to capture words with bad syntax via (?![\S\D],[\S\D])
(capture where there's a non-space/digit followed by a comma by a non-space/digit), and join that with another regex to capture numbers with bad syntax, via (?!(.?^(?:\d+|\d{1,3}(?:,\d{3}))(?:.\d+)
. Putting that together gets me
preg_match_all("/(?![\S\D],[\S\D])|(?!(.*?^(?:\d+|\d{1,3}(?:,\d{3})*)(?:\.\d+)?$))/",$str,$syntax_result);
.. but obviously it doesn't work. How should it be done?
================EDIT================
Thanks to Casimir et Hippolyte's answer below, I got it to work! I've updated his answer to take care of more corner cases. Idk if the syntax I added is the most efficient, but it works, for now. I'll update this as more corner cases come up!
$pattern = <<<'LOD'
~
(?: # this group contains allowed commas
[\w\)]+,((?=[ ][\w\s\(\"]+)|(?=[\s]+)) # comma between words or line break
|
(?<=^|[^\PP,]|[£$\s]) [0-9]{1,3}(?:,[0-9]{3})* (?=[€\s]|[^\PP,]|$) # thousands separator
) (*SKIP) (*FAIL) # make the pattern fail and forbid backtracking
| , # other commas
~mx
LOD;
Upvotes: 6
Views: 1681
Reputation: 89547
It isn't waterproof, but this can give you an idea on how to proceed:
$pattern = <<<'LOD'
~
(?: # this group contains allowed commas
\w+,(?=[ ]\w+) # comma between words
|
(?<=^|[^\PP,]|[£$\s]) [0-9]{1,3}(?:,[0-9]{3})* (?=[€\s]|[^\PP,]|$) # thousands separator
) (*SKIP) (*FAIL) # make the pattern fail and forbid backtracking
| , # other commas
~mx
LOD;
preg_match_all($pattern, $text, $matches, PREG_OFFSET_CAPTURE);
print_r($matches[0]);
The idea is to exclude allowed commas from the match result to only obtain incorrect commas. The first non-capturing group contains a kind of blacklist for correct situations. (You can easily add other cases).
[^\PP,]
means "all punctuation characters except ,
", but you can replace this character class by a more explicit list of allowed characters, example : [("']
You can find more informations about (*SKIP)
and (*FAIL)
here and here.
Upvotes: 3