Renier
Renier

Reputation: 1535

How to group provided string correctly?

I have the following regex:

^([A-Za-z]{2,3}\d{6}|\d{5}|\d{3})((\d{3})?)(\d{2}|\d{3}|\d{6})(\d{2}|\d{3})$

I use this regex to match different, yet similar strings:

# MOR644-004-007-001
MOR644004007001 # string provided
# VUF00101-050-08-01
VUF001010500801 # string provided
# MF001317-077944-01
MF00131707794401 # string provided

These strings need to match/group as it is at the top of the strings, however my problem is that it is not grouping it correctly

The first string: MOR644004007001 is grouped: (MOR644004) (007) (001) which should be (MOR644) (004) (007) (001)

The second string: VUF001010500801 is grouped (VUF001010) (500) (801) which should be (VUF00101) (050) (08) (01)

How can I change ([A-Za-z]{2,3}\d{6}|\d{5}|\d{3})((\d{3})?) so that it would group the provided string correctly?

Upvotes: 1

Views: 70

Answers (3)

hypersonics
hypersonics

Reputation: 1134

Take a look at this. I have used what's called as a named group. As pointed out earlier by others, it's better to have one regex code for each string. I have shown here for the first string, MOR644004007001. Easily you can expand for other two strings:

import re

# MOR644-004-007-001
MOR = "MOR644004007001" # string provided
# VUF00101-050-08-01
VUF = "VUF001010500801" # string provided
# MF001317-077944-01
MF = "MF00131707794401" # string provided

MORcompile = re.compile(r'(?P<first>\w{,6})(?P<second>\d{,3})(?P<third>\d{,3})(?P<fourth>\d{,3})')
MORsearch = MORcompile.search(MOR.strip())
print MORsearch.group('first')
print MORsearch.group('second')
print MORsearch.group('third')
print MORsearch.group('fourth')

MOR644
004
007
001

Upvotes: 0

SamWhan
SamWhan

Reputation: 8332

Considering your comment to Serv I'd say the (only?) solution is to have one regex for each possibility, like -

MOR(\d{3})(\d{3})(\d{3})(\d{3})|VUF(\d{5})(\d{3})(\d{2})(\d{2})|MF(\d{6})(\d{6})(\d{2})

and then use the execution environment (JS/php/python - you haven't provided which one) to piece the parts together.

See example on regex101 here. Note that substitution, only as an example, matches only the second string.

Regards

Upvotes: 1

Marco Virgolin
Marco Virgolin

Reputation: 165

I am not sure that you can do what you want to. Let's consider the first two strings:

# MOR644-004-007-001
MOR644004007001 # string provided
# VUF00101-050-08-01
VUF001010500801 # string provided

Now, both the strings are composed of 3 chars followed by 12 digits. Thus, given a regex R, if R does not depend on particular (sequences of) characters and on particular (sequences of) digits (i.e., it presents [A-Za-z] and \d but does not present, let's say, MO and 0070), then it will match both the string in the same way.

So, if you want to operate a different matching, then you need to look at the particular occurrence of certain characters or digits. We need more data from you in order to give you an aswer.

Finally, I suggest you to take a look at this tool: http://regex.inginf.units.it/ (demo: http://regex.inginf.units.it/demo.html). It is a research project that automatically generates a regex given (many) examples of extraction. I warmly suggest you to try it, especially if you know that an underlying pattern is present in your case for sure (i.e. strings beginning with VUF must be matched differently from strings beginning with MOR) but you are unable to find it. Again, you will need to provide many examples to the engine. Needles to say, if a generic pattern does not exist, then the tool won't find it ;)

Upvotes: 2

Related Questions