N_B
N_B

Reputation: 311

Alternative approach to strip symbols in a string

I am working on a function which retains symbols that is inside of a word(a word can consist of a-zA-Z,0-9 and _), but removes every other symbol outside the word:

For example: 
Input String - hell_o ? my name _ i's <hel'lo/>
Output - ['hell_o' ,'my', 'name', '_', "i's" ,'hel'lo']

The function i am using :

l = ' '.join(filter(None,(word.strip(punctuation.replace("_","")) for word in input_String.split())))
l = re.sub(r'\s+'," ",l)
t = str.split(l.lower())

I know this is not the best, optimal way!!Does anyone recommend any alternatives that i can try??Probably a regEx to do this??

Upvotes: 0

Views: 259

Answers (1)

antoni
antoni

Reputation: 5546

You can match any character different than a-zA-Z, 0-9 and _ as you mention, between 2 letters with (?<=[a-z])\W(?=[a-z]) and replace it with nothing, to remove it.

In the end you will have a very dangerous algorithm for instance in the sentence I'm fine.And you? if there is no space after the dot it will end up in I'm fineAnd you? which may not be what you want.


[EDIT] after your comments.

Ok I misunderstood your question.

Now I came along with the one regex you want to select 'hell_o' ,'my', 'name', "i's" ,'hel'lo':

(?<![a-z])[a-z][^\s]*[a-z](?![a-z]).

You can see it working here: https://regex101.com/r/EAEelq/3. (don't forget the i and g flags).


[EDIT] As you also want to match the _ outside a word

ok so if you want the underscores to be matched also update as is: (?<![a-z_])[a-z_][^\s]*[a-z_](?![a-z_])|(?<= )[a-z_](?= ).

See it working here: https://regex101.com/r/EAEelq/4

Upvotes: 1

Related Questions