Pepe
Pepe

Reputation: 484

Python regex to remove alphanumeric characters without removing words at the end of the string

I'm trying to clean some text by removing alphanumeric characters from the end of the string, but I'm also removing normal words as shown on the output. Can someone help me achieve the expected result?

re.sub(r'[a-zA-Z0-9/]{5,}$', '', text)

asus zenfone 3s max zc521tl
asus zenfone max plus (m1) zb570tl
asus zenfone max pro (m1) zb601kl/zb602k
nokia 3.1 c
nokia 3
asus zenfone 3 zoom ze553k
asus zenfone 3 deluxe zs570kl
blackberry keyone
htc explorer
lg tribute
acer liquid z520

Output:

asus zenfone 3s max 
asus zenfone max plus (m1) 
asus zenfone max pro (m1) 
nokia 3.1 c
nokia 3
asus zenfone 3 zoom 
asus zenfone 3 deluxe 
blackberry 
htc 
lg 
acer liquid z520

Expected output:

asus zenfone 3s max
asus zenfone max plus (m1) 
asus zenfone max pro (m1)
nokia 3.1 c
nokia 3
asus zenfone 3 zoom 
asus zenfone 3 deluxe 
**blackberry keyone**
**htc explorer**
**lg tribute**
acer liquid z520

Upvotes: 1

Views: 1324

Answers (2)

The fourth bird
The fourth bird

Reputation: 163362

If it should be the last word in a string and there are always multiple words, you might use:

[ \t]+(?=[a-zA-Z0-9/]{5})[a-zA-Z/]*[0-9][a-zA-Z0-9/]*[A-Za-z]$
  • [ \t]+ Match 1+ spaces or tabs
  • (?=[a-zA-Z0-9/]{5}) Assert at least 5 chars of any of the listed
  • [a-zA-Z/]* Match 0+ times any of the listed
  • [0-9] Match a digit
  • [a-zA-Z0-9/]* Match 0+ times any of the listed in the character class
  • [A-Za-z] Match a char a-zA-Z
  • $ End of string

Regex demo

In the replacement use an empty string.

Upvotes: 1

Daniel Jonsson
Daniel Jonsson

Reputation: 3893

You can add a positive look-ahead to the regex that requires the word at the end to contain at least one digit for it to be removed: (?=\D*\d). That will prevent it from removing normal words that don't contain numbers.

The complete program:

#!/usr/bin/env python3
import re

texts = [
    'asus zenfone 3s max zc521tl',
    'asus zenfone max plus (m1) zb570tl',
    'asus zenfone max pro (m1) zb601kl/zb602k',
    'nokia 3.1 c',
    'nokia 3',
    'asus zenfone 3 zoom ze553k',
    'asus zenfone 3 deluxe zs570kl',
    'blackberry keyone',
    'htc explorer',
    'lg tribute',
    'acer liquid z520',
]

for text in texts:
    print(re.sub(r'(?=\D*\d)[a-zA-Z0-9/]{5,}$', '', text))

It outputs:

asus zenfone 3s max 
asus zenfone max plus (m1) 
asus zenfone max pro (m1) 
nokia 3.1 c
nokia 3
asus zenfone 3 zoom 
asus zenfone 3 deluxe 
blackberry keyone
htc explorer
lg tribute
acer liquid z520

Upvotes: 1

Related Questions