David542
David542

Reputation: 110382

Regex expression to strip ending of word

I have the following identifiers:

id1 = '883316040119_FRIENDS_HD'
id2 = 'ZWEX01DE9463DB_DMD' 
id3 = '35358fr1'
id4 = 'as3d99j_br001'

I need a regex to get me the following output:

id1 = '883316040119'
id2 = 'ZWEX01DE9463DB' 
id3 = '35358'
id4 = 'as3d99j'

Here is what I have so far --

re.sub(r'_?([a-zA-Z]{2,4}?\d?(00\d)?)$','',vendor_id)

It doesn't work perfectly though, here is what it gives me:

BAD  - 883316040119_FRIENDS
GOOD - ZWEX01DE9463DB
GOOD - 35358
GOOD - as3d99j

What would be the correct regular expression to get all of them? For the first one, I basically want to strip the ending if it is only underscores and letters, so 1928h9829_bundle_hd --> 1928h9829.

Please note that I have hundreds of thousands of identifiers here, and it is required that I use a regular expression. I'm not looking for a python split() way to do it, as it wouldn't work.

Upvotes: 0

Views: 129

Answers (5)

Kei Minagawa
Kei Minagawa

Reputation: 4521

This is not subtraction approach. Just capture matched string.

The regex is ^[0-9]+)|(^[a-zA-Z0-9]+(?=_).(ie (^\d+)|(^[\d\w]+(?=_)))

import re
id1 = '883316040119_FRIENDS_HD'
id2 = 'ZWEX01DE9463DB_DMD' 
id3 = '35358fr1'
id4 = 'as3d99j_br001'
ids = [id1, id2, id3, id4]

for i in ids:
    try:
        print re.match(r"(^[0-9]+)|(^[a-zA-Z0-9]+(?=_))", i).group()
    except:
        print "not matched"

output:

883316040119
ZWEX01DE9463DB
35358
as3d99j

Upvotes: 0

Peter Raynham
Peter Raynham

Reputation: 657

This works for the examples:

for id in ids :
    print (id)

883316040119_FRIENDS_HD
ZWEX01DE9463DB_DMD
35358fr1
as3d99j_br001

for id in ids :
    hit = re.sub( "(_[A-Za-z_]*|_?[A-Za-z]{2,4}?\d?(00\d)?)$", "", id)
    print (hit)

883316040119
ZWEX01DE9463DB
35358
as3d99j

When the tail contains letters and underscores, then the pattern is easygoing and strips off any number of underscores and digits; if the tail does not contain an underscore, or contains digits after the underscore, then it demands the pattern in the question: 0/2/3/4 letters then an optional digit then an optional zero-zero-digit.

Upvotes: 1

zx81
zx81

Reputation: 41838

The way you present your input, I would suggest this simple regex:

^(?:[^_]+(?=_)|\d+)

This can be tweaked if you want to add details to the spec.

To show you a regex demo, just because of the way the site regex101 works, we have to add \n (it assumes we are working on the whole file, rather than one input at a time): DEMO

Explanation

  • The ^ anchor asserts that we are at the beginning of the string
  • The non-capture group (?: ... ) matches either
  • [^_]+(?=_) non-underscore characters (followed by an underscore, not matched)
  • | OR
  • \d+ digits

Upvotes: 2

Ron Rosenfeld
Ron Rosenfeld

Reputation: 60344

The following reproduces your desired results from your input.

I would use the replace method with this regex:

_[^']+|(?!.*_)('[0-9]+)[^']+

and return capturing group 1

Perhaps:

result = re.sub("_[^']+|(?!.*_)('[0-9]+)[^']+", r"\1", subject)

The regex first looks for an underscore. If it finds one, it will match everything up to but not including the next single quote; and that will get removed.

If that doesn't match, the alternative will look for a string that does NOT have an underscore; match and return in capturing group 1 the sequence of digits; and then replace everything after the digits up to but not including the single quote.

Upvotes: 0

denisvm
denisvm

Reputation: 740

You are checking for underscore only one possible time, as ? means {0,1}.

r'(_[a-zA-Z]{2,}\d?(00[0-9])?|[a-z]{2,}\d)+$'

Upvotes: 0

Related Questions