Reputation: 1001
I am trying to clean the string such that it does not have any punctuation or number, it must only have a-z and A-Z. For example,given String is:
"coMPuter scien_tist-s are,,, the rock__stars of tomorrow_ <cool> ????"
Required output is :
['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']
My solution is
re.findall(r"([A-Za-z]+)" ,string)
My output is
['coMPuter', 'scien', 'tist', 's', 'are', 'the', 'rock', 'stars', 'of', 'tomorrow', 'cool']
Upvotes: 4
Views: 1557
Reputation: 2120
using re
, although I'm not sure this is what you want because you said you didn't want "cool" leftover.
import re
s = "coMPuter scien_tist-s are,,, the rock__stars of tomorrow_ <cool> ????"
REGEX = r'([^a-zA-Z\s]+)'
cleaned = re.sub(REGEX, '', s).split()
# ['coMPuter', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow', 'cool']
EDIT
WORD_REGEX = re.compile(r'(?!<?\S+>)(?=\w)(\S+)')
CLEAN_REGEX = re.compile(r'([^a-zA-Z])')
def cleaned(match_obj):
return re.sub(CLEAN_REGEX, '', match_obj.group(1)).lower()
[cleaned(x) for x in re.finditer(WORD_REGEX, s)]
# ['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']
WORD_REGEX
uses a positive lookahead for any word characters and a negative lookahead for <...>. Whatever non-whitespace that makes it past the lookaheads is grouped:
(?!<?\S+>) # negative lookahead
(?=\w) # positive lookahead
(\S+) #group non-whitespace
cleaned
takes the match groups and removes any non-word characters with CLEAN_REGEX
Upvotes: 1
Reputation: 1001
With the recommendation of all of the people who answered I got the correct solution that i really wants , Thanks to every one...
s = "coMPuter scien_tist-s are,,, the rock__stars of tomorrow_ <cool> ????"
cleaned = re.sub(r'(<.*>|[^a-zA-Z\s]+)', '', s).split()
print cleaned
Upvotes: 3
Reputation: 368904
You don't need to use regular expression:
(Convert the string into lower case if you want all lower-cased words), Split words, then filter out word that starts with alphabet:
>>> s = "coMPuter scien_tist-s are,,, the rock__stars of tomorrow_ <cool> ????"
>>> [filter(str.isalpha, word) for word in s.lower().split() if word[0].isalpha()]
['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']
In Python 3.x, filter(str.isalpha, word)
should be replaced with ''.join(filter(str.isalpha, word))
, because in Python 3.x, filter
returns a filter object.
Upvotes: 5