snh_nl
snh_nl

Reputation: 2955

Regex to split string with the last occurrence of a dot, colon or underscore

we have thousands of rows of data containing articlenumers in all sort of formats and I need to split off main article number from a size indicator. There is (almost) always a dot, dash or underscore between some last characters (not always 2).

In short: Data is main article number + size indicator, the separator is differs but 1 of 3 .-_

Question: how do I split main article number + size indicator? My regex below isn't working that I built based on some Google-ing.

preg_match('/^(.*)[\.-_]([^\.-_]+)$/', $sku, $matches);

Sample data + expected result

AR.110052.15-40 [AR.110052.15 & 40]
BI.533.41-41 [BI.533.41 & 41]
CG.00554.000-39 [CG.00554.000 & 39]
LL.PX00.SC004-40 [LL.PX00.SC004 & 40]
LOS.HAPPYSOCKS.1X [LOS.HAPPYSOCKS & 1X]
MI.PMNH300043-XXXXL [MI.PMNH300043 & XXXXL]

Upvotes: 1

Views: 1743

Answers (2)

mickmackusa
mickmackusa

Reputation: 47854

Use preg_split() instead of preg_match() because:

  1. this isn't a validation task, it is an extraction task and
  2. preg_split() returns the exact desired array compared to preg_match() which carries the unnecessary fullstring match in its returned array.

Limit the number of elements produced (like you would with explode()'s limit parameter.

No capture groups are needed at all.

Greedily match zero or more characters, then just before matching the latest occurring delimiter, restart the fullstring match with \K. This will effectively use the matched delimiter as the character to explode on and it will be "lost" in the explosion.

Code: (Demo)

$strings = [
    'AR.110052.15-40',
    'BI.533.41-41',
    'CG.00554.000-39',
    'LL.PX00.SC004-40',
    'LOS.HAPPYSOCKS.1X',
    'MI.PMNH300043-XXXXL',
];

foreach ($strings as $string) {
    var_export(preg_split('~.*\K[._-]~', $string, 2));
    echo "\n";
}

Output:

array (
  0 => 'AR.110052.15',
  1 => '40',
)
array (
  0 => 'BI.533.41',
  1 => '41',
)
array (
  0 => 'CG.00554.000',
  1 => '39',
)
array (
  0 => 'LL.PX00.SC004',
  1 => '40',
)
array (
  0 => 'LOS.HAPPYSOCKS',
  1 => '1X',
)
array (
  0 => 'MI.PMNH300043',
  1 => 'XXXXL',
)

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

You need to move the - to the end of character class to make the regex engine parse it as a literal hyphen:

^(.*)[._-]([^._-]+)$

See the regex demo. Actually, even ^(.+)[._-](.+)$ will work.

  • ^ - matches the start of string
  • (.*) - Group 1 capturing any 0+ chars as many as possible up to the last...
  • [._-] - either . or _ or -
  • ([^._-]+) - Group 2: one or more chars other than ., _ and -
  • $ - end of string.

Upvotes: 2

Related Questions