Reputation: 21
I have a list of words and I want to remove all special characters and numbers, here is what I cam up with:
INPUT: #convert all words to lowercase
words = [word.lower() for word in words]
print(words[:100])
OUTPUT:
['rt', '@', 'dark', 'money', 'has', 'played', 'a', 'significant', 'role', 'in', 'the', 'overall', 'increase', 'of', 'election', 'spending', 'in', 'state', 'judicial', 'elections.', 'https://e85zq', 'rt', '@', 'notice,', 'women,', 'how', 'you', 'are', 'always', 'the', 'target', 'of', 'democrats’', 'fear', 'mongering', 'in', 'an', 'election', 'year', 'or', 'scotus', 'confirmation.', 'it', 'is', 'not', 'because', 'our', 'rights', 'are', 'actually', 'at', 'risk.', 'it', 'is', 'because', 'we', 'are', 'easily', 'manipulated.', 'goes', 'allll', 'the', 'way', 'back', 'to', 'eve.', 'resist', 'hysteria', '&', 'think.', 'rt', '@', 'oct', '5:', 'last', 'day', 'to', 'register', 'to', 'vote.', 'oct', '13:', 'early', 'voting', 'starts.', 'oct', '23:', 'last', 'day', 'to', 'request', 'a', 'mail-in', 'ballot.', 'nov', '3:', 'election', 'day', 'rt', '@']
INPUT
words_cleaned = [re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", i) for i in words]
print(words_cleaned[:100])
OUTPUT
I end up with an empty string []
What I need is characters like '@' to be removed, and a character like '@test' to turn to 'test'. any ideas?
Upvotes: 0
Views: 323
Reputation: 23089
You can use built in shortcuts rather than have to specify all of the special characters. Here's a way to remove everything but "word characters":
import re
inp = ['rt', '@', 'dark', 'money', 'has', 'played', 'a', '#significant', 'role', 'in', 'tRhe', 'overall', 'increase', 'of', 'election', 'spending', 'in', 'state', 'judicial', 'elections.', 'https://e85zq', 'rt', '@', 'notice,', 'women,', 'how', 'you', 'are', 'always', 'the', 'target', 'of', 'democrats’', 'fear', 'mongering', 'in', 'an', 'election', 'year', 'or', 'scotus', 'confirmation.', 'it', 'is', 'not', 'because', 'our', 'rights', 'are', 'actually', 'at', 'risk.', 'it', 'is', 'because', 'we', 'are', 'easily', 'manipulated.', 'goes', 'allll', 'the', 'way', 'back', 'to', 'eve.', 'resist', 'hysteria', '&', 'think.', 'rt', '@', 'oct', '5:', 'last', 'day', 'to', 'register', 'to', 'vote.', 'oct', '13:', 'early', 'voting', 'starts.', 'oct', '23:', 'last', 'day', 'to', 'request', 'a', 'mail-in', 'ballot.', 'nov', '3:', 'election', 'day', 'rt', '@']
outp = [re.sub(r"[^A-Za-z]+", '', s) for s in inp]
print(outp)
Result:
['rt', '', 'dark', 'money', 'has', 'played', 'a', 'significant', 'role', 'in', 'tRhe', 'overall', 'increase', 'of', 'election', 'spending', 'in', 'state', 'judicial', 'elections', 'httpse85zq', 'rt', '', 'notice', 'women', 'how', 'you', 'are', 'always', 'the', 'target', 'of', 'democrats', 'fear', 'mongering', 'in', 'an', 'election', 'year', 'or', 'scotus', 'confirmation', 'it', 'is', 'not', 'because', 'our', 'rights', 'are', 'actually', 'at', 'risk', 'it', 'is', 'because', 'we', 'are', 'easily', 'manipulated', 'goes', 'allll', 'the', 'way', 'back', 'to', 'eve', 'resist', 'hysteria', 'amp', 'think', 'rt', '', 'oct', '5', 'last', 'day', 'to', 'register', 'to', 'vote', 'oct', '13', 'early', 'voting', 'starts', 'oct', '23', 'last', 'day', 'to', 'request', 'a', 'mailin', 'ballot', 'nov', '3', 'election', 'day', 'rt', '']
The ^
character here means match everything NOT mentioned in the set of characters that follow inside a []
pair. \w
means "word characters"
. So the whole thing says "match everything but word characters." The nice thing about using a regular expression is that you can get arbitrarily precise as to just which characters you want to include or exclude.
No need to slice the result with [:100
to print it. Just print it as is, like I do. I assume that by using 100
, you're wanting to make sure you go to the end of the list. The better way to do that is to just leave that component blank. So [:]
means "take a slice of the string that is the full string", and [5:]
means "take from the 6th character to the end of the string".
UPDATE: I just noticed that you said you don't want numbers in the result. So then I guess you just want letters. I changed the expression to do that. This is what's nice about a regular expression. You can tweak just what gets replaced without adding additional calls, loops, etc. but rather just change a string value.
Upvotes: 2
Reputation: 9494
If you want to remove all non-letters chars, try:
words = ["".join(filter(lambda c: c.isalpha(), word)) for word in words]
print(words)
Upvotes: 3