user2402616
user2402616

Reputation: 1563

Regex to get all numbers after a character

I have strings that are expected to be in the format of something like

"C 1,13,7,2,55" I would expect matches to be [1,13,7,2,55].

I want to match on all numbers in that "csv" portion of the string. But only if it comes after "C " Note a space after the 'c'

This comes from user-input and so I want to account for case and multiple space(s) in between tokens and accidental double commas, etc..

I.e. "c 1 , 12,15 , 8 , 9,10,11" I want matches to be [1,12,15,8,9,10,11]

But I only want to attempt to match on numbers after the "C" char (case-insensitive).

So "1,2 , 4,5" and "d 12456, 9890" should fail .

Here's the regex I have half-baked so far.

Note: This will ultimately get ported over to PHP and so I will be using preg_match_all

/(?<=C)*\d+/gim

I use a positive lookbehind (but match as many times as needed) for the "C" char. Then match on 1 or more digits globally.

I haven't created all my unit tests yet, but I think this may work.

Is there a better way to do this? Is matching on 1or more positive lookbehinds standard?

Why don't I need to include a \s* after the 'C' in the positive lookbehind? When would including the 'm' multi-line flag even make a difference here?

Thanks!

Upvotes: 1

Views: 218

Answers (2)

The fourth bird
The fourth bird

Reputation: 163217

Using this pattern /(?<=C)*\d+/gim; in for example Javascript it would not be valid due to the quantifier after the lookbehind assertion.

If you want to write it in JavaScript getting all the digits after C at the start of the string, and the quantifier in the lookbehind is supported:

(?<=^C [\d, ]*)\d+

Regex demo

Using (?<=C)*\d+ in PHP, the quantifier for the lookbehind is optional, and it would also match 8 and 9 in for example this string 8,9 C 1,13,7,2,55

Using a quantifier with infinite length in a lookbehind assertion is not supported in PHP so you can not use (?<=C\h+)\d+ where \h+ would match 1+ spaces due to S


If you are using PHP, you can make use of the \G anchor to match only consecutive numbers after the first C character.

For a single line, you don't need the multi line flag. You do need it for multiple lines due to the anchor.

(?:^\h*C\h+|\G(?!^))\h*,*\h*\K\d+

The pattern matches:

  • (?: Non capture group
    • ^ Start of string
    • \h*C\h+ Match optional spaces, then C and 1+ spaces
    • | Or
    • \G(?!^) Assert the position at the end of the previous match (not at the start)
  • ) Close the non capture group
  • \h*,*\h*\K Match optional comma's between optional spaces
  • \d+ Match 1 or more digits

Regex demo | Php demo

$regex = '/(?:\h*C\h+|\G(?!^))\h*,*\h*\K\d+/i';
$strings = [
    "C 1,13,7,2,55",
    "c    1   ,  12,15     ,   8     ,   9,10,11",
    "1,2  ,  4,5",
    "d 12456, 9890"
];

foreach ($strings as $s) {
    if (preg_match_all($regex, $s, $matches)) {
        print_r($matches[0]);
    }
}

Output

Array
(
    [0] => 1
    [1] => 13
    [2] => 7
    [3] => 2
    [4] => 55
)
Array
(
    [0] => 1
    [1] => 12
    [2] => 15
    [3] => 8
    [4] => 9
    [5] => 10
    [6] => 11
)

Upvotes: 1

MikeM
MikeM

Reputation: 13631

The simplest option is probably to first test for the "C" using the case-insensitive stripos before matching the digits with \d+. For example:

$input = "c    1   ,  12,15     ,   8     ,   9,10,11";

if (stripos($input, "C") === 0) {
    preg_match_all("/\d+/", $input, $matches);
    print_r($matches);
}

The condition could be for example stripos($input, "C") !== false if the "C" does not have to be the first character.

To validate that the string starts with "C" (possibly after horizontal whitespace), and contains only horizontal whitespace, commas and digits then the test could instead be

if (preg_match("/^\h*C[\h\d,]+$/i", $input)) { 

The lookbehind in your regex /(?<=C)*\d+/gim is made optional by the *, so the regex does not require that a "C" is present for the digits to be matched. It is functionally equivalent to just /\d+/g.

Is matching on 1 or more positive lookbehinds standard?

In this case, the lookbehind would need to be variable-width (?<=C.*) and php does not support variable-width lookbehinds.

Why don't I need to include a \s* after the 'C' in the positive lookbehind?

Php does not support the use of the * quantifier inside a lookbehind.

When would including the 'm' multi-line flag even make a difference here?

You only might want to use the m flag if your input is multi-lined and you are using ^ or $ which assert the start or end respectively of a line or the whole string.

Upvotes: 3

Related Questions