Andrew Seaman
Andrew Seaman

Reputation: 21

Python - Remove Special Characters from list

I have a list of words and I want to remove all special characters and numbers, here is what I cam up with:

INPUT: #convert all words to lowercase

words = [word.lower() for word in words]
print(words[:100])

OUTPUT:

['rt', '@', 'dark', 'money', 'has', 'played', 'a', 'significant', 'role', 'in', 'the', 'overall', 'increase', 'of', 'election', 'spending', 'in', 'state', 'judicial', 'elections.', 'https://e85zq', 'rt', '@', 'notice,', 'women,', 'how', 'you', 'are', 'always', 'the', 'target', 'of', 'democrats’', 'fear', 'mongering', 'in', 'an', 'election', 'year', 'or', 'scotus', 'confirmation.', 'it', 'is', 'not', 'because', 'our', 'rights', 'are', 'actually', 'at', 'risk.', 'it', 'is', 'because', 'we', 'are', 'easily', 'manipulated.', 'goes', 'allll', 'the', 'way', 'back', 'to', 'eve.', 'resist', 'hysteria', '&', 'think.', 'rt', '@', 'oct', '5:', 'last', 'day', 'to', 'register', 'to', 'vote.', 'oct', '13:', 'early', 'voting', 'starts.', 'oct', '23:', 'last', 'day', 'to', 'request', 'a', 'mail-in', 'ballot.', 'nov', '3:', 'election', 'day', 'rt', '@']

INPUT

words_cleaned = [re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", i) for i in words]

print(words_cleaned[:100])

OUTPUT

I end up with an empty string []

What I need is characters like '@' to be removed, and a character like '@test' to turn to 'test'. any ideas?

Upvotes: 0

Views: 323

Answers (2)

CryptoFool
CryptoFool

Reputation: 23089

You can use built in shortcuts rather than have to specify all of the special characters. Here's a way to remove everything but "word characters":

import re

inp = ['rt', '@', 'dark', 'money', 'has', 'played', 'a', '#significant', 'role', 'in', 'tRhe', 'overall', 'increase', 'of', 'election', 'spending', 'in', 'state', 'judicial', 'elections.', 'https://e85zq', 'rt', '@', 'notice,', 'women,', 'how', 'you', 'are', 'always', 'the', 'target', 'of', 'democrats’', 'fear', 'mongering', 'in', 'an', 'election', 'year', 'or', 'scotus', 'confirmation.', 'it', 'is', 'not', 'because', 'our', 'rights', 'are', 'actually', 'at', 'risk.', 'it', 'is', 'because', 'we', 'are', 'easily', 'manipulated.', 'goes', 'allll', 'the', 'way', 'back', 'to', 'eve.', 'resist', 'hysteria', '&amp;', 'think.', 'rt', '@', 'oct', '5:', 'last', 'day', 'to', 'register', 'to', 'vote.', 'oct', '13:', 'early', 'voting', 'starts.', 'oct', '23:', 'last', 'day', 'to', 'request', 'a', 'mail-in', 'ballot.', 'nov', '3:', 'election', 'day', 'rt', '@']

outp = [re.sub(r"[^A-Za-z]+", '', s) for s in inp]

print(outp)

Result:

['rt', '', 'dark', 'money', 'has', 'played', 'a', 'significant', 'role', 'in', 'tRhe', 'overall', 'increase', 'of', 'election', 'spending', 'in', 'state', 'judicial', 'elections', 'httpse85zq', 'rt', '', 'notice', 'women', 'how', 'you', 'are', 'always', 'the', 'target', 'of', 'democrats', 'fear', 'mongering', 'in', 'an', 'election', 'year', 'or', 'scotus', 'confirmation', 'it', 'is', 'not', 'because', 'our', 'rights', 'are', 'actually', 'at', 'risk', 'it', 'is', 'because', 'we', 'are', 'easily', 'manipulated', 'goes', 'allll', 'the', 'way', 'back', 'to', 'eve', 'resist', 'hysteria', 'amp', 'think', 'rt', '', 'oct', '5', 'last', 'day', 'to', 'register', 'to', 'vote', 'oct', '13', 'early', 'voting', 'starts', 'oct', '23', 'last', 'day', 'to', 'request', 'a', 'mailin', 'ballot', 'nov', '3', 'election', 'day', 'rt', '']

The ^ character here means match everything NOT mentioned in the set of characters that follow inside a [] pair. \w means "word characters" . So the whole thing says "match everything but word characters." The nice thing about using a regular expression is that you can get arbitrarily precise as to just which characters you want to include or exclude.

No need to slice the result with [:100 to print it. Just print it as is, like I do. I assume that by using 100, you're wanting to make sure you go to the end of the list. The better way to do that is to just leave that component blank. So [:] means "take a slice of the string that is the full string", and [5:] means "take from the 6th character to the end of the string".

UPDATE: I just noticed that you said you don't want numbers in the result. So then I guess you just want letters. I changed the expression to do that. This is what's nice about a regular expression. You can tweak just what gets replaced without adding additional calls, loops, etc. but rather just change a string value.

Upvotes: 2

Gabio
Gabio

Reputation: 9494

If you want to remove all non-letters chars, try:

words = ["".join(filter(lambda c: c.isalpha(), word)) for word in words]
print(words)

Upvotes: 3

Related Questions