Reputation: 7
I would like to please know how I can replace all regular expressions matches except for a certain chosen character.
I need to clean data. An example of the data is
`some-really,dirty.data%#$_.`
which I would like to look like
some-reallydirtydata_
Note the -
between some
and really
. That is my chosen character that I would not like to remove.
Here is a snippet of my code:
import re
unclean_string = "some-really,dirty.data%#$_."
clean_string = re.sub('\W', '', unclean_string)
print clean_string
>>>"somereallydirtydata_"
I know that \W
removes all but "0 to 9, a to z, A to Z, and underscore".
I want to know how I can remove all of that, plus a chosen character (such as -
).
Disclaimer: I apologise in advance for asking such a basic question. I am new to Python and using regex.
Upvotes: 0
Views: 79
Reputation: 39365
Include hyphen with your regex:
clean_string = re.sub('[^-\w]', '', unclean_string)
Explanation of the regex:
NODE EXPLANATION
--------------------------------------------------------------------------------
[^-\w] any character except: '-', word characters
(a-z, A-Z, 0-9, _)
Upvotes: 0
Reputation: 71538
You can use:
clean_string = re.sub(r'[^\w-]', '', unclean_string)
[^\w]
is the equivalent of \W
. So, if you add a -
in there, you will not match it either.
Note: I also rawed the regex string above because it's a good practice to do so. This prevents unexpected behaviour you might have especially during escaping.
Upvotes: 2