Roy
Roy

Reputation: 159

How to keep specific words when preprocessing words for NLP?(str.replace & regex)

I want to remove digit except '3d', this word. I've tried some methods but failed. Please look through my simple code below:


s = 'd3 4 3d'
rep_ls = re.findall('([0-9]+[a-zA-Z]*)', s)

>> ['3', '4', '3d']

for n in rep_ls:
    if n == '3d':
        continue
    s = s.replace(n, '')

>> s = 'd  d'
>> expected = 'd 3d'

Upvotes: 0

Views: 157

Answers (3)

The fourth bird
The fourth bird

Reputation: 163632

To remove all digits except the word 3d you could use a negative lookahead (?! to assert what is directly to the right is not 3d between word boundaries \b

Then match 1+ digits \d+

In the replacement use an empty string.

(?!\b3d\b)\d+

Regex demo

Upvotes: 2

Emma
Emma

Reputation: 27743

Maybe, this expression,

(?i)(3d)\b|(\D+)|\d+

might work OK with re.sub of \1\2.

Demo

If 3D would be also undesired, which we are assuming otherwise here, then (?i) can be safely removed:

(3d)\b|(\D+)|\d+

Anything else other than 3d that you wish to keep would go in the first capturing group:

(3d|4d|anything_else)\b|(\D+)|\d+

Test

import re

regex = r'(?i)(3d)\b|(\D+)|\d+'
string = '''d3 4 3d'''

print(re.sub(regex, r'\1\2', string))

Output

d 3d

Demo 2

RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Upvotes: 0

Code Maniac
Code Maniac

Reputation: 37775

You're very close, you simply need to split the value by space and then loop over the value if the value is 3d don't change it else change it

import re;
s = 'd3 4 3d'
rep_ls = re.split(r'\s+', s)

final = ''
for n in rep_ls:
    if n == '3d':
        final +=' 3d'
        continue
    final +=  ' ' + re.sub(r'\d+','',n)


print(final)

Trim the string in the end to remove the extra space or use an if statement to not add space when index is 0


Or you can use dictionary and join them later

import re;
s = 'd3 4 3d'
rep_ls = re.split(r'\s+', s)

final = []
for n in rep_ls:
    if n == '3d':
        final.append(n)
        continue
    final.append(re.sub(r'\d+','',n))

final = " ".join(final)    
print(final)

Output is >> d 3d

Upvotes: 1

Related Questions