anish
anish

Reputation: 1033

need help regarding the following regular expression

i am using python with re module for regular expressions i want to remove all the characters from the string except numbers and characters. To achieve this i am using sub function

Code Snippet:-

>>> text="foo.bar"

>>> re.sub("[^A-Z][^a-z]","",text)

'fobar'

I wanted to know why above expression removes the "o."?

I am not able to understand why it removes the "o" Can someone please explain me what is going on behind this?

I know to correct solution of this problem is

>>> re.sub("[^A-Z ^a-z]","",text)

'foobar'

Thanks in advance

Upvotes: 1

Views: 111

Answers (3)

ThomasH
ThomasH

Reputation: 23516

A very important aspect to realize is that [^A-Z][^a-z] represents two characters (one for each character group), while [^A-Za-z] represents only one.

Upvotes: 3

Stofke
Stofke

Reputation: 2958

Explained in detail:

The [^A-Z] means all characters except uppercase A to Z, the second o in foo.bar is not uppercase so it matches as a matter of fact everything in foo.bar is matched at this point.

Then you add [^a-z] so you look for a character that is not lowercase, only the dot matches.

Combine both and you look for a non-uppercase character followed by a non-lowercase character so this matches o.

The solution is the one proposed by Ignacio.

Upvotes: 2

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 798626

Because o matches [^A-Z] and . matches [^a-z].

And the correct solution is [^A-Za-z0-9].

Upvotes: 1

Related Questions