Spark_TheCat
Spark_TheCat

Reputation: 99

How to remove every non-alphabetic character in Python 3

I am coding the cesar chipper in Python 3, I have hit the point where I have to get rid of special characters in the chipper part. My current solution actually works but unwanted characters pass through:

chain = "abcàéÉç"
listOfChain = list(chain)
   for element in listOfChain:
      if element.isalpha():
           print(element)

The code above should only have print abc but àéÉç has passed. I only want to have A-Z and a-z, without éèêëç and so on... How to check if these characters are in the list ?
So far isalpha() let those pass. Any other way to do that?

Upvotes: 4

Views: 4311

Answers (3)

Don O'Donnell
Don O'Donnell

Reputation: 4728

According to 3.3 docs:

str.isalpha() Return true if all characters in the string are alphabetic and there is at least one character, false otherwise. Alphabetic characters are those characters defined in the Unicode character database as “Letter”, i.e., those with general category property being one of “Lm”, “Lt”, “Lu”, “Ll”, or “Lo”. Note that this is different from the “Alphabetic” property defined in the Unicode Standard.

So isalpha() includes all foreign accented characters as well as the acsii letters which you want.

The easiest way to isolate these may be to import string.ascii_letters which is a string of all lower and upper case ASCII letters, then

>>> from string import ascii_letters
>>> for element in chars:
>>>    if element in ascii_letters:
>>>        print(element)

Upvotes: 4

Maxime Lorant
Maxime Lorant

Reputation: 36181

With Python 3, you can use the list string.ascii_letters which contains the list of every alphabetic characters.

>>> import string
>>> chain = 'abcàéÉç'
>>> listOfChain = [x for x in chain if x in string.ascii_letters]
>>> listOfChain
['a', 'b', 'c']

Compared to the regex solution of @hkpeprah, it's more efficient:

# Regex solution
>>> timeit.timeit('[l for l in chain if re.search("[^a-zA-Z]", l) == None]', setup='chain="abcàéÉç"; import re', number=100000)
6.374363899230957
# string contains solution
>>> timeit.timeit("[x for x in chain if x in string.ascii_letters]", setup="chain='abcàéÉç'; import string;", number=100000)
0.24501395225524902

Upvotes: 1

Ford
Ford

Reputation: 2597

You can use re

>>> re.search("[^a-zA-z]", "abcdef")
>>> re.search("[^a-zA-z]", "abcdef2")
<_sre.SRE_Match object at 0x10ddb78b8>
>>> re.search("[^a-zA-Z]", "abcàéÉç")
<_sre.SRE_Match object at 0x10ddb7850>

This then makes your if statement

if re.search("[^a-zA-Z]", element) == None:
    print element

Note: If you want to allow numbers as well, you can replace [^a-zA-Z] with [^\w] or even simpiler [\W]

Edit: For simplicity you can even do

chain = abcàéÉç
listOfChain = list(chain)
listOfChain = [l for l in listOfChain if re.search("[^a-zA-Z]", l) == None]
print "\n".join(listOfChain)

Upvotes: 0

Related Questions