Reputation: 61

PCRE php regex to match groups correctly

I have the following sample text :

tabela de Preço 18654 TONER XEROX 106R01632 MA(6000/6010 117.90 129.90 18656 TONER XEROX 106R01634 PR 6000/6010 179.00 199.00 UDP COMPUT ADORES IBYTE 32607 UDP A - GCL(CDCP 2.41,2,500) 747.00 829.90 32148 UDP A - GCL(CDCP 2.41,2,500) 747.00 829.90 32149 UDP A - GCL(CDCP 2.41,4,500,DVD) 769.90 879.00 32555 UDP A - GCL(CDCP 2.41,4,500,DVD) 769.90 879.00 32490 UDP A - ICL(CDCP 2.41,2,500) 747.00 829.90 32150 UDP A - ICL(CDCP 2.41,2,500) 747.00 829.90 32024 UDP A - ICW10(CDC 2.8,4,500,DVD) 1 260.001 399.90 32445 UDP A - ICW10(CDC 2.8,4,500,DVD) 1 260.001 399.90 31060 UDP A - ISW10PRO(CDCP 2.41,4,500)SLI1 349.901 549.90 32356 UDP F - GCL(I3 6G 3.7,4,500,DVD,LT) 1 699.001 929.90

and I have to match it in groups like:

code, description,value1,value2

using that excerpt as a source:

"18654 TONER XEROX 106R01632 MA(6000/6010 117.90 129.90"

its a product and i need to parse it as follows:

"18654" is the code
"TONER XEROX 106R01632 MA(6000/6010" is the description
"117.90" is the value1
"129.90" is the value2

but the description,value1 and value2 lengths varies and while I have products that have value1 like "117.90" I also have "1 699.00" and "90.00".

Im trying the following regex to capture the groups,but it matches correctly some not the whole source string:

(?<code>\d{5})\s{1}(?<description>.{20,35})\s{1}(?<value1>\d{2,3}\.\d{2})\s{1}(?<value2>\d{2,3}\.\d{2})

How do I capture the groups correctly for each product in this sample source string using pcre (php) ?

I have the following regex101.com url to show what I have tryied https://regex101.com/r/Smh2KA/3

Thanks in advance.

Upvotes: 0

Answers (3)

Stephane Janicaud

Reputation: 3627

This one should work :

(?<code>\d{5})\s+(?<description>((?!\d{2,}\.\d{1,}).)*)\s+(?<value1>\d{2,3}\.\d{1,})((?!\d{2,}\.\d{1,}).)*(?<value2>\d{2,}\.\d{1,})

Here is a Demo based on your initial text and here a simplier one

It returns 35 matches as expected, including this one which was a little tricky because value1 and value2 where not separated by a simple space :

31069 UDP GAMER - IGW10(I7 3.4,8,1,DVD,PV)4 499.0 04 999.90

Upvotes: 0

Casimir et Hippolyte

Reputation: 89557

You can use this pattern:

$pattern = '~\b (?<id>\d{5}) \s
           (?<desc>.*?) \s*+
           (?<val1>
               (?: \d \s*(?=[\d\s]*\.\d\s?\d\s*(?<c>(?(c)\g{c})\s*\d)) )+
               \.\d\s?\d
           ) \s*
           (?<val2>\g{c}\d?\.\d{2})~x';

demo

The subpattern in val1 checks if for each digit in the integer part of val1 there's a digit for the integer part in val2. That's why this part is a bit complicated. But the advantage is that confusion is no more possible between the description part and the first value.

val1 subpattern details:

(?:
    \d \s* # 1 digit in val1 (and an eventual space)
    (?= # lookahead that checks if for this digit there's also
        # a digit in val2
        [\d\s]*\.\d\s?\d\s* # reach val2
        (?<c> # open a capture group c
             (?(c)\g{c}) # conditional: if the capture group c has already captured
                         # something then start the group with the backreference \g{c}
                         # (this means that the non-captured group has been repeated
                         # at least once)
             \s*\d       # add the next digit to c
        )
    )
)+ # repeat the non-capturing group
\.\d\s?\d

Note that this pattern needs a lot of steps to succeed. If you need to use it on a big input, I suggest to split the string before each code and then to search each part with preg_match and the previous pattern (you can start it with the ^ anchor instead of \b):

$parts = preg_split('~\b(?=\d{5}\b)~', $str);
$result = [];
foreach ($parts as $part) {
    preg_match($pattern, $part, $m);
    $result[] = [$m['id'], $m['desc'], $m['val1'], $m['val2']];
}

Upvotes: 1

Wiktor Stribiżew

Reputation: 626845

I suggest a regex like

\b(?<code>\d{5})\s+(?<description>.*?)\s+(?<value1>\d[,\d\s]*\.\d{2})\s*(?<value2>\d[,\d\s]*\.\d{2})

See the regex demo

A version with comments:

\b                           # leading word boundary
(?<code>\d{5})               # 5 digits
\s+                          # 1+ whitespaces
(?<description>.*?)          # any 0+ non-line break chars
\s+                          # 1+ whitespaces
(?<value1>\d[,\d\s]*\.\d{2}) # a float number with 2-digit fractional part
\s*                          # 0+ whitespaces
(?<value2>\d[,\d\s]*\.\d{2}) # a float number

NOTE: If your float values (value1 and value2) contain , as thousand separators and . as a decimal separator, adust their patterns as \d[,\d]*\.\d+. If the thousand separator is a space, use \d[\d\s]*\.\d+. If the thousand separator is a space and a decimal separator is a comma, use \d[\d\s]*,\d+. And so on and so forth.

Upvotes: 1

PCRE php regex to match groups correctly

Answers (3)

Related Questions