JavaSa
JavaSa

Reputation: 6241

Check that a string contains only ASCII characters?

How do I check that a string only contains ASCII characters in Python? Something like Ruby's ascii_only?

I want to be able to tell whether string specific data read from file is in ascii

Upvotes: 16

Views: 48063

Answers (4)

warvariuc
warvariuc

Reputation: 59604

In Python 3.7 were added methods which do what you want:

str, bytes, and bytearray gained support for the new isascii() method, which can be used to test if a string or bytes contain only the ASCII characters.


Otherwise:

>>> all(ord(char) < 128 for char in 'string')
True
>>> all(ord(char) < 128 for char in 'строка')
False

Another version:

>>> def is_ascii(text):
    if isinstance(text, unicode):
        try:
            text.encode('ascii')
        except UnicodeEncodeError:
            return False
    else:
        try:
            text.decode('ascii')
        except UnicodeDecodeError:
            return False
    return True
... 
>>> is_ascii('text')
True
>>> is_ascii(u'text')
True
>>> is_ascii(u'text-строка')
False
>>> is_ascii('text-строка')
False
>>> is_ascii(u'text-строка'.encode('utf-8'))
False

Upvotes: 36

rotten
rotten

Reputation: 1630

If you have unicode strings you can use the "encode" function and then catch the exception:

try:
    mynewstring = mystring.encode('ascii')
except UnicodeEncodeError:
    print("there are non-ascii characters in there")

If you have bytes, you can import the chardet module and check the encoding:

import chardet

# Get the encoding
enc = chardet.detect(mystring)['encoding']

Upvotes: 6

Quinn
Quinn

Reputation: 4504

You can also opt for regex to check for only ascii characters. [\x00-\x7F] can match a single ascii character:

>>> OnlyAscii = lambda s: re.match('^[\x00-\x7F]+$', s) != None
>>> OnlyAscii('string')
True
>>> OnlyAscii('Tannh‰user')
False

Upvotes: 6

Girish Jadhav
Girish Jadhav

Reputation: 194

A workaround to your problem would be to try and encode the string in a particular encoding.

For example:

'H€llø'.encode('utf-8')

This will throw the following error:

Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1: ordinal not in range(128)

Now you can catch the "UnicodeDecodeError" to determine that the string did not contain just the ASCII characters.

try:
    'H€llø'.encode('utf-8')
except UnicodeDecodeError:
    print 'This string contains more than just the ASCII characters.'

Upvotes: 0

Related Questions