Alex
Alex

Reputation: 3571

Using regex to check comma's usage

How can I write a regular expression that spots incorrect usage of a comma in a string, ie.: 1. for non-numbers, no space before and 1 space after; 2. for numbers, commas are allowed if preceded by 1-3 digits and followed by 3 digits.

Some test cases:

So I thought I'd have a regex to capture words with bad syntax via (?![\S\D],[\S\D]) (capture where there's a non-space/digit followed by a comma by a non-space/digit), and join that with another regex to capture numbers with bad syntax, via (?!(.?^(?:\d+|\d{1,3}(?:,\d{3}))(?:.\d+). Putting that together gets me

preg_match_all("/(?![\S\D],[\S\D])|(?!(.*?^(?:\d+|\d{1,3}(?:,\d{3})*)(?:\.\d+)?$))/",$str,$syntax_result);

.. but obviously it doesn't work. How should it be done?

================EDIT================

Thanks to Casimir et Hippolyte's answer below, I got it to work! I've updated his answer to take care of more corner cases. Idk if the syntax I added is the most efficient, but it works, for now. I'll update this as more corner cases come up!

$pattern = <<<'LOD'
~
(?: # this group contains allowed commas
    [\w\)]+,((?=[ ][\w\s\(\"]+)|(?=[\s]+))  # comma between words or line break
  |
    (?<=^|[^\PP,]|[£$\s]) [0-9]{1,3}(?:,[0-9]{3})* (?=[€\s]|[^\PP,]|$) # thousands separator
) (*SKIP) (*FAIL) # make the pattern fail and forbid backtracking
| , # other commas
~mx
LOD;

Upvotes: 6

Views: 1681

Answers (1)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89547

It isn't waterproof, but this can give you an idea on how to proceed:

$pattern = <<<'LOD'
~
(?: # this group contains allowed commas
    \w+,(?=[ ]\w+)  # comma between words
  |
    (?<=^|[^\PP,]|[£$\s]) [0-9]{1,3}(?:,[0-9]{3})* (?=[€\s]|[^\PP,]|$) # thousands separator
) (*SKIP) (*FAIL) # make the pattern fail and forbid backtracking
| , # other commas
~mx
LOD;

preg_match_all($pattern, $text, $matches, PREG_OFFSET_CAPTURE);

print_r($matches[0]);

The idea is to exclude allowed commas from the match result to only obtain incorrect commas. The first non-capturing group contains a kind of blacklist for correct situations. (You can easily add other cases).

[^\PP,] means "all punctuation characters except ,", but you can replace this character class by a more explicit list of allowed characters, example : [("']

You can find more informations about (*SKIP) and (*FAIL) here and here.

Upvotes: 3

Related Questions