Using python to select only the non ascii characters in a string

Question

If i have a string called mystring and it has stored in it: Ümeå I would like to store the non ascii characters Üå in a list.

Below is my code, and it is almost working, but the list contains hex characters (ie \xc3\xa6) rather than the correctly encoded chars :

try:
   mystring.iloc[i].decode('ascii')
   i+=1
except:
   nonascii_string = str(mystring.iloc[i])
   j=0
   #now we've found the string, isolate the non ascii characters
   for item in str(profile_data_nonascii_string):
      try:
         str(nonascii_string[j].decode('ascii'))
         j+=1
      except:
         # PROBLEM: Need to work out how to encode back to proper UTF8 values
         nonascii_chars_list.append(str(nonascii_string[j]))
         j+=1
      i+=1
      pass

I think i need to do something like:

chr(profile_data_nonascii_string[j].encode('utf-8'))

but of course doing that only selects the first byte of my multibyte character (and hence throws an error). I am sure there is a simple solution... :-|

Padraic Cunningham · Accepted Answer

You can create a mapping if the chars you want to remove and str.translate them from the string:

In [29]: tbl = dict.fromkeys(range(128), u"")

In [30]: s = u'Ümeå'

In [31]: print(s.translate(tbl))
Üå

In pandas which you seem to be using you can use pandas.Series.str.translate.

Series.str.translate(table, deletechars=None)

Map all characters in the string through the given mapping table. Equivalent to standard str.translate(). Note that the optional argument deletechars is only valid if you are using python 2. For python 3, character deletion should be specified via the table argument.

translate is going to be more efficient than str.join:

In [7]: s = 'Ümeå' * 1000

In [8]: timeit ''.join([x for x in s if ord(x) > 127])
1000 loops, best of 3: 489 µs per loop

In [9]: timeit s.translate(tbl)
1000 loops, best of 3: 289 µs per loop
In [10]: s.translate(tbl) ==  ''.join([x for x in s if ord(x) > 127])
Out[10]: True

For pandas with python2, you need deletechars with None:

In [2]: import pandas as pd

In [3]: raw_data = {'Name' : pd.Series(['david','åndrëw','calvin'], index=['a', 'b', 'c'])}

In [4]: df = pd.DataFrame(raw_data, columns = ['Name'])

In [5]: delete = "".join(map(chr,range(128)))

In [6]: print df['Name'].str.translate(None, delete)
a      
b    åë
c      
Name: Name, dtype: object

For python3 using the dict works fine:

In [9]: import pandas as pd

In [10]: raw_data = {'Name' : pd.Series(['david','åndrëw','calvin'], index=['a', 'b', 'c'])}

In [11]: 

In [11]: df = pd.DataFrame(raw_data, columns = ['Name'])

In [12]: 

In [12]: delete = dict.fromkeys(range(128), "")

In [13]: df['Name'].str.translate(delete)
Out[13]: 
a      
b    åë
c      
Name: Name, dtype: object

The different approaches needed are documented:

Parameters:

table : dict (python 3), str or None (python 2) In python 3, table is a mapping of Unicode ordinals to Unicode ordinals, strings, or None. Unmapped characters are left untouched. Characters mapped to None are deleted. str.maketrans() is a helper function for making translation tables. In python 2, table is either a string of length 256 or None. If the table argument is None, no translation is applied and the operation simply removes the characters in deletechars. string.maketrans() is a helper function for making translation tables. deletechars : str, optional (python 2) A string of characters to delete. This argument is only valid in python 2.

Using python to select only the non ascii characters in a string

Answers (2)

Related Questions