Reputation: 4244
I have this line to remove all non-alphanumeric characters except spaces
re.sub(r'\W+', '', s)
Although, it still keeps non-English characters.
For example if I have
re.sub(r'\W+', '', 'This is a sentence, and here are non-english 托利 苏 !!11')
I want to get as output:
> 'This is a sentence and here are non-english 11'
Upvotes: 23
Views: 60147
Reputation: 123
This might not be an answer to this concrete question but i came across this thread during my research.
I wanted to reach the same objective as the questioner but I wanted to include non English characters such as: ä,ü,ß, ...
The way the questioners code works, spaces will be deleted too.
A simple workaround is the following:
re.sub(r'[^ \w+]', '', string)
The ^ implies that everything but the following is selected. In this case \w, thus every word character (including non-English), and spaces.
I hope this will help someone in the future
Upvotes: 10
Reputation: 5011
I once had this exact problem, the only difference was that I wasn't able to import anything or use regex.
To solve my problem I created a list containing all of the values I wanted to keep:
values = list("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 ")
Then I created a function that would loop through each item in the string and if it wasn't in the values
list, it'd remove (replace) it from the string:
def remover(my_string = ""):
for item in my_string:
if item not in values:
my_string = my_string.replace(item, "")
return my_string
For example, the following code:
print(remover("H!e£l$l%o^ W&o*r(l)d!:)"))
Should output:
'Hello World'
Sure this isn't the best way to do this but given the circumstances, it was a quick and easy way to get job done.
NOTE: you can replace the items that are in the values
list by changing if item not in values
to if item in values
.
NOTE: I wasn't allowed to use string constants because the string
package has to be imported to use them.
Good luck.
Upvotes: 3
Reputation: 4740
re.sub(r'[^A-Za-z0-9 ]+', '', s)
(Edit) To clarify:
The []
create a list of chars. The ^
negates the list. A-Za-z
are the English alphabet and is space. For any one or more of these (that is, anything that is not A-Z, a-z, or space,) replace with the empty string.
Upvotes: 52