Reputation: 110382
I have the following identifiers:
id1 = '883316040119_FRIENDS_HD'
id2 = 'ZWEX01DE9463DB_DMD'
id3 = '35358fr1'
id4 = 'as3d99j_br001'
I need a regex to get me the following output:
id1 = '883316040119'
id2 = 'ZWEX01DE9463DB'
id3 = '35358'
id4 = 'as3d99j'
Here is what I have so far --
re.sub(r'_?([a-zA-Z]{2,4}?\d?(00\d)?)$','',vendor_id)
It doesn't work perfectly though, here is what it gives me:
BAD - 883316040119_FRIENDS
GOOD - ZWEX01DE9463DB
GOOD - 35358
GOOD - as3d99j
What would be the correct regular expression to get all of them? For the first one, I basically want to strip the ending if it is only underscores and letters, so 1928h9829_bundle_hd --> 1928h9829
.
Please note that I have hundreds of thousands of identifiers here, and it is required that I use a regular expression. I'm not looking for a python split()
way to do it, as it wouldn't work.
Upvotes: 0
Views: 129
Reputation: 4521
This is not subtraction approach. Just capture matched string.
The regex is ^[0-9]+)|(^[a-zA-Z0-9]+(?=_)
.(ie (^\d+)|(^[\d\w]+(?=_))
)
import re
id1 = '883316040119_FRIENDS_HD'
id2 = 'ZWEX01DE9463DB_DMD'
id3 = '35358fr1'
id4 = 'as3d99j_br001'
ids = [id1, id2, id3, id4]
for i in ids:
try:
print re.match(r"(^[0-9]+)|(^[a-zA-Z0-9]+(?=_))", i).group()
except:
print "not matched"
output:
883316040119
ZWEX01DE9463DB
35358
as3d99j
Upvotes: 0
Reputation: 657
This works for the examples:
for id in ids :
print (id)
883316040119_FRIENDS_HD
ZWEX01DE9463DB_DMD
35358fr1
as3d99j_br001
for id in ids :
hit = re.sub( "(_[A-Za-z_]*|_?[A-Za-z]{2,4}?\d?(00\d)?)$", "", id)
print (hit)
883316040119
ZWEX01DE9463DB
35358
as3d99j
When the tail contains letters and underscores, then the pattern is easygoing and strips off any number of underscores and digits; if the tail does not contain an underscore, or contains digits after the underscore, then it demands the pattern in the question: 0/2/3/4 letters then an optional digit then an optional zero-zero-digit.
Upvotes: 1
Reputation: 41838
The way you present your input, I would suggest this simple regex:
^(?:[^_]+(?=_)|\d+)
This can be tweaked if you want to add details to the spec.
To show you a regex demo, just because of the way the site regex101 works, we have to add \n
(it assumes we are working on the whole file, rather than one input at a time): DEMO
Explanation
^
anchor asserts that we are at the beginning of the string(?: ... )
matches either[^_]+(?=_)
non-underscore characters (followed by an underscore, not matched)|
OR\d+
digitsUpvotes: 2
Reputation: 60344
The following reproduces your desired results from your input.
I would use the replace method with this regex:
_[^']+|(?!.*_)('[0-9]+)[^']+
and return capturing group 1
Perhaps:
result = re.sub("_[^']+|(?!.*_)('[0-9]+)[^']+", r"\1", subject)
The regex first looks for an underscore. If it finds one, it will match everything up to but not including the next single quote; and that will get removed.
If that doesn't match, the alternative will look for a string that does NOT have an underscore; match and return in capturing group 1 the sequence of digits; and then replace everything after the digits up to but not including the single quote.
Upvotes: 0
Reputation: 740
You are checking for underscore only one possible time, as ?
means {0,1}
.
r'(_[a-zA-Z]{2,}\d?(00[0-9])?|[a-z]{2,}\d)+$'
Upvotes: 0