Rafael
Rafael

Reputation: 203

Grouping in regular expressions

I want to explore a bit more on regular expressions. Add a space on a string but counting right to left

The result of this regex

preg_replace("/(?=(.{3})*(.{4})$)/", "-", "1231231234");

is: 123-123-1234

Now, I am experimenting with the quantifiers and groups, but I can not make them to work properly.

Why this (php)

preg_replace("/(?=(.{3})*(.{4})(.{4})$)/", "-", "1212312312345678");

and this:

preg_replace("/(?=(.{3})*(.{4}){2}$)/", "-", "1212312312345678");

both give me a big 8 character group as an output

12-123-123-12345678

I probably expected the result on the second case {2}, but not on the first case.

The expected result I intended was:

12-123-123-1234-5678

1) What is the the logic on (.{4})(.{4}) = (.{8}) instead of being 2 diferent events?

2) What would be the proper grouping?

Upvotes: 0

Views: 89

Answers (3)

Aran-Fey
Aran-Fey

Reputation: 43136

You seem to misunderstand how that regex works. Let me break it down for you:

(?=          lookahead assertion: the following pattern must match, but
             will not consume any of the text.
   (.{3})*   matches a series of 3 characters, any number of times. In
             other words, this consumes characters in multiples of 3.
   (.{4})$   makes sure there are exactly 4 characters left.
)

This pattern produces an empty match in every place where you want to insert a dash -. That's why preg_replace("/(?=(.{3})*(.{4})$)/", "-", "1231231234"); inserts dashes in the correct places - replacing the empty string is the same as inserting. Let's look at that step-by-step, using the text 31231234 as an example:

         remaining text     remaining pattern      what happens
step 0:  31231234           (.{3})*(.{4})$         (.{3})* matches one time
step 1:  31234              (.{3})*(.{4})$         (.{3})* matches again
step 2:  34                 (.{3})*(.{4})$         (.{3})* fails to match another time
step 3:  34                 (.{4})$                (.{4}) fails to match -> backtrack
step 5:  31234              (.{4})$                (.{4}) fails to match -> pattern failed to
                                                   match, no dash will be inserted.

After the pattern failed to match at position 0 in the text, it will be checked again at position 1 (remaining text is 1231234):

         remaining text     remaining pattern      what happens
step 0:  1231234            (.{3})*(.{4})$         (.{3})* matches one time
step 1:  1234               (.{3})*(.{4})$         (.{3})* matches again
step 2:  4                  (.{3})*(.{4})$         (.{3})* fails to match another time
step 3:  4                  (.{4})$                (.{4})$ matches -> dash will be inserted
                                                   here, giving "3-1231234"

The same thing happens again 3 characters later, giving the end result 3-123-1234. In other words, the group (.{4})$ specifies that no dashes should be inserted in the last 4 characters of the text. By consuming the last 4 characters, it makes it impossible for the pattern to match if there are less than 4 characters remaining. That is why both (.{4})(.{4})$ and (.{4}){2}$ produce a block of 8 characters - the pattern can not match if less than 8 characters remain.

In order to insert another dash in the last 8 characters, you have to use two groups of 4 characters .{4} and make one of them optional:

(?=((.{3})*.{4})?(.{4})$)

Upvotes: 1

Ruslan Osmanov
Ruslan Osmanov

Reputation: 21492

(?=(.{3})*(.{4}){2}$) matches every 3xN character sequence with 2x4 = 8 characters at the end, where N >= 0.

To match every 4xN character from the end, where 1 <= N <= 2, or every 3xN character sequence with 8 characters at the end, where N >= 1, use the following:

preg_replace("/(?=(.{4}){1,2}$)|(?=(.{3})+.{8}$)/", "-", "1212312312345678");

Upvotes: 1

Sebastian Proske
Sebastian Proske

Reputation: 8413

Note that you are using lookaheads in this case. Unlike normal matching, they don't actually consume what they match.

So in the first example, there are 2 zero-width-matches, the first one after the first 123, so the lookahead matches for 1231234, the second after the second 123, where the lookahead matches 1234. You might want to use one of the online-regex-testers to see what actually matches, my choice would be regex101.com.

So for your example you have to make the lookahead also match the last 4 digits (and only them), one way to achieve this would be (?=((.{3})*(.{4}))?(.{4})$), making the first part optional.

See it here on regex101.

Upvotes: 2

Related Questions