Reputation: 345
If i have a string called mystring and it has stored in it: Ümeå I would like to store the non ascii characters Üå in a list.
Below is my code, and it is almost working, but the list contains hex characters (ie \xc3\xa6) rather than the correctly encoded chars :
try:
mystring.iloc[i].decode('ascii')
i+=1
except:
nonascii_string = str(mystring.iloc[i])
j=0
#now we've found the string, isolate the non ascii characters
for item in str(profile_data_nonascii_string):
try:
str(nonascii_string[j].decode('ascii'))
j+=1
except:
# PROBLEM: Need to work out how to encode back to proper UTF8 values
nonascii_chars_list.append(str(nonascii_string[j]))
j+=1
i+=1
pass
I think i need to do something like:
chr(profile_data_nonascii_string[j].encode('utf-8'))
but of course doing that only selects the first byte of my multibyte character (and hence throws an error). I am sure there is a simple solution... :-|
Upvotes: 0
Views: 1148
Reputation: 180522
You can create a mapping if the chars you want to remove and str.translate
them from the string:
In [29]: tbl = dict.fromkeys(range(128), u"")
In [30]: s = u'Ümeå'
In [31]: print(s.translate(tbl))
Üå
In pandas which you seem to be using you can use pandas.Series.str.translate.
Series.str.translate(table, deletechars=None)
Map all characters in the string through the given mapping table. Equivalent to standard str.translate(). Note that the optional argument deletechars is only valid if you are using python 2. For python 3, character deletion should be specified via the table argument.
translate
is going to be more efficient than str.join
:
In [7]: s = 'Ümeå' * 1000
In [8]: timeit ''.join([x for x in s if ord(x) > 127])
1000 loops, best of 3: 489 µs per loop
In [9]: timeit s.translate(tbl)
1000 loops, best of 3: 289 µs per loop
In [10]: s.translate(tbl) == ''.join([x for x in s if ord(x) > 127])
Out[10]: True
For pandas with python2, you need deletechars
with None:
In [2]: import pandas as pd
In [3]: raw_data = {'Name' : pd.Series(['david','åndrëw','calvin'], index=['a', 'b', 'c'])}
In [4]: df = pd.DataFrame(raw_data, columns = ['Name'])
In [5]: delete = "".join(map(chr,range(128)))
In [6]: print df['Name'].str.translate(None, delete)
a
b åë
c
Name: Name, dtype: object
For python3 using the dict works fine:
In [9]: import pandas as pd
In [10]: raw_data = {'Name' : pd.Series(['david','åndrëw','calvin'], index=['a', 'b', 'c'])}
In [11]:
In [11]: df = pd.DataFrame(raw_data, columns = ['Name'])
In [12]:
In [12]: delete = dict.fromkeys(range(128), "")
In [13]: df['Name'].str.translate(delete)
Out[13]:
a
b åë
c
Name: Name, dtype: object
The different approaches needed are documented:
Parameters:
table : dict (python 3), str or None (python 2) In python 3, table is a mapping of Unicode ordinals to Unicode ordinals, strings, or None. Unmapped characters are left untouched. Characters mapped to None are deleted. str.maketrans() is a helper function for making translation tables. In python 2, table is either a string of length 256 or None. If the table argument is None, no translation is applied and the operation simply removes the characters in deletechars. string.maketrans() is a helper function for making translation tables. deletechars : str, optional (python 2) A string of characters to delete. This argument is only valid in python 2.
Upvotes: 0
Reputation: 168866
Here is how I separated the non-ASCII chars from the string your example string:
In [7]: s=u'Ümeå'
In [8]: print s
Ümeå
In [9]: s2 = u''.join(x for x in s if ord(x) > 126)
In [10]: print s2
Üå
Or, if you prefer your answers in a list:
In [15]: s=u'Ümeå'
In [16]: print s
Ümeå
In [17]: s2 = list(x for x in s if ord(x) > 126)
In [18]: print s2[0]
Ü
In [19]: print s2[1]
å
Upvotes: 2