Reputation: 151
(\w{1,4})(?:\s{0,1})(\d{1,4})(?:\s{0,1})(\w{1,4})\s
Apologies if this is really ugly regex but I am not fluent in it at all.
I need a regex function to extract all possible combinations from motor cycle names for instance:
From a Honda CBR500R I would need to get CBR, 500 and R. I am not sure if I regex could give me CBR500 and 500R as that would be really sweet!
Some type of bike names:
Honda CBR500R
CBR 500 R
CBR 500R
CBR500 R
GS1000 S
XYZT 1000P
500ztx
KLR250 Honda
FZR 600 Suzuki
SV650
Text here XXXX 9999 XXXX 9999 XXXXX more text here
Is there a way to improve my regex? making it simpler and smarter?
Upvotes: 1
Views: 136
Reputation: 627044
You can use
([A-Z]{2,})?[\s-]*(\d+)([a-z]+)?[\s-]*([A-Z]*\b)
See the regex demo
The regex matches:
([A-Z]{2,})?
- Group 1: one or zero sequence of 2 or more capital ASCII letters[\s-]*
- zero or more -
or whitespace symbols(\d+)
- Group 2: one or more digits([a-z]+)?
- Group 3: one or zero sequence of one or more ASCII lowercase letters[\s-]*
- zero or more -
or whitespace symbols([A-Z]*\b)
- Group 4: zero or more ASCII uppercase letters followed by a word boundary.Here is a sample extraction code in Python:
import re
p = re.compile(r'([A-Z]{2,})?[\s-]*(\d+)([a-z]+)?[\s-]*([A-Z]*\b)')
test_str = "Honda CBR500R\nCBR 500 R\nCBR 500R\nCBR500 R\nGS1000 S\nXYZT 1000P\n500ztx\nKLR250 Honda\nFZR 600 Suzuki\nText here XXXX 9999 XXXX 9999 XXXXX more text here"
for s in p.findall(test_str):
print("New Entry:")
for r in s:
if r:
print(r)
Output:
New Entry:
CBR
500
R
New Entry:
CBR
500
R
New Entry:
CBR
500
R
New Entry:
CBR
500
R
New Entry:
GS
1000
S
New Entry:
XYZT
1000
P
New Entry:
500
ztx
New Entry:
KLR
250
New Entry:
FZR
600
New Entry:
XXXX
9999
XXXX
New Entry:
9999
XXXXX
Upvotes: 1
Reputation: 4504
I come up with the following pattern. No sure if it is what you expected (duplicates are not removed):
import re
txt = """
Honda CBR500R
CBR 500 R
CBR 500R
CBR500 R
GS1000 S
XYZT 1000P
500ztx
KLR250 Honda
FZR 600 Suzuki
SV650
Text here XXXX 9999 XXXX 9999 XXXXX more text here
"""
pattern = r'[A-Z]+\d+|\d+[A-Z]|[A-Z]+(?![a-z])|\d+[a-z]+|\d+'
print re.findall(pattern, txt)
Output is:
['CBR500', 'R', 'CBR', '500', 'R', 'CBR', '500R', 'CBR500', 'R', 'GS1000', 'S', 'XYZT', '1000P', '500ztx', 'KLR250', 'FZR', '600', 'SV650', 'XXXX', '9999', 'XXXX', '9999', 'XXXXX']
If you want to capture '500R' from 'CBR500R' also:
p1 = r'[A-Z]+\d+|(?<!\d)[A-Z]+(?![a-z])|\d+[a-z]+|\d+(?![0-9A-Z])'
p2 = r'\d+[A-Z]'
print re.findall(p1, txt) + re.findall(p2, txt)
Output is:
['CBR500', 'CBR', '500', 'R', 'CBR', 'CBR500', 'R', 'GS1000', 'S', 'XYZT', '500ztx', 'KLR250', 'FZR', '600', 'SV650', 'XXXX', '9999', 'XXXX', '9999', 'XXXXX', '500R', '500R', '1000P']
Upvotes: 1