Chrickey
Chrickey

Reputation: 21

Removing special characters and symbols from a string in python

I am trying to do what my title says. I have a list of about 30 thousand business addressess, and I'm trying to make each address as uniform as possible

As far as removing weird symbols and characters goes, I have found three suggestions, but I don't understand how they are different.

If somebody can explain the difference, or provide insight into a better way to standardize address information, please and thank you!

address = re.sub(r'([^\s\w]|_)+', '', address)

address = re.sub('[^a-zA-Z0-9-_*.]', '', address)

address = re.sub(r'[^\w]', ' ', address)

Upvotes: 2

Views: 9560

Answers (3)

tripleee
tripleee

Reputation: 189830

The enumeration [^a-zA-Z0-9-_*.] enumerates exactly the character ranges to remove (though the literal - should be at the beginning or end of the character class).

\w is defined as "word character" which in traditional ASCII locales included A-Z and a-z as well as digits and underscore, but with Unicode support, it matches accented characters, Cyrillics, Japanese ideographs, etc.

\s matches space characters, which again with Unicode includes a number of extended characters such as the non-breakable space, numeric space, etc.

Which exactly to choose obviously depends on what you want to accomplish and what you mean by "special characters". Numbers are "symbols", all characters are "special", etc.

Here's a pertinent quotation from the Python re documentation:

\s

For Unicode (str) patterns:

Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [ \t\n\r\f\v] may be a better choice).

For 8-bit (bytes) patterns:

Matches characters considered whitespace in the ASCII character set; this is equivalent to [ \t\n\r\f\v].

\w

For Unicode (str) patterns:

Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [a-zA-Z0-9_] may be a better choice).

For 8-bit (bytes) patterns:

Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_].

Upvotes: 1

Christopher Apple
Christopher Apple

Reputation: 401

How you read the re.sub function is like this (more docs):

re.sub(a, b, my_string)  # replace any matches of regex a with b in my_string

I would go with the second one. Regexes can be tricky, but this one says:

[^a-zA-Z0-9-_*.]   # anything that's NOT a-z, A-Z, 0-9, -, * .

Which seems like it's what you want. Whenever I'm using regexes, I use this site:

http://regexr.com/

You can put in some of your inputs, and make sure they are matching the right kinds of things before throwing them in your code!

Upvotes: 0

Pedro Castilho
Pedro Castilho

Reputation: 10532

The first suggestion uses the \s and \w regex wildcards.

\s means "match any whitespace". \w means "match any letter or number".

This is used as an inverted capture group ([^\s\w]), which, all together, means "match anything which isn't whitespace, a letter or a number". Finally, it is combined using an alternative | with _, which will just match an underscore and given a + quantifier which matches one or more times.

So what this says is: "Match any sequence of one or more characters which aren't whitespace, letters, numbers or underscores and remove it".

The second option says: "Match any character which isn't a letter, number, hyphen, underscore, dot or asterisk and remove it". This is stated by that big capture group (the stuff between the brackets).

The third option says "Take anything which is not a letter or number and replace it by a space". It uses the \w wildcard, which I have explained.

All of the options use Regular Expressions in order to match character sequences with certain characteristics, and the re.sub function, which sub-stitutes anything matched by the given regex by the second string argument.

You can read more about Regular Expressions in Python here.

Upvotes: 1

Related Questions