Reputation: 2797
I have a column with addresses and want to find all rows that contain 'foreign' i.e. non-ASCII characters.
import pandas as pd
df = pd.DataFrame.from_dict({
'column_name': ["GREENLAND HOTEL, CENTRAL AVENUE, NAGPUR-440 018.", "Møllegade 1234567 DK-6400 Sønderborg Denmark"],
'column_other': ["0", "1"]
})
Expected output is it will display only the 2nd row which contains the "ø" character.
I tried this:
df['column_name'].str.isascii()
but in Python 3.6 at least this does not work.
In MySQL I can do this equivalent
SELECT * FROM `table_name` WHERE `column_name`!=CONVERT(`column_name` USING ASCII)
which works.
Upvotes: 3
Views: 3310
Reputation: 19
In this example, the dataframe is named data
First of all make a list of columns of string datatype
cols = ["A", "B", "C"]
Run the code below to loop through the columns to state the number of values in each column that have the non-ascii characters
for col in cols:
print(col, data[col].str.len().ne(data[col].str.encode("ascii", errors="ignore").str.len()).sum())
A 0
B 3
C 0
For this example, column B
has the non-ascii values
Run the modified code below to get the rows with the non-ascii values
data["B"].str.len().ne(data["B"].str.encode("ascii", errors="ignore").str.len())
Upvotes: 1
Reputation: 29635
Not sure about the performance, but you can try to encode
the string and ignore
the errors, then compare the len
of the encoded string to the original one like:
mask_nonAscii = df.column_name.map(lambda x: len(x.encode('ascii', errors='ignore')) != len(x))
print (df[mask_nonAscii])
column_name column_other
1 Møllegade 1234567 DK-6400 Sønderborg Denmark 1
EDIT: thanks to a comment of anky_91 you can also use str.len
and str.encode
from pandas like:
mask_nonAscii = df['column_name'].str.len()\
.ne(df['column_name'].str.encode('ascii',errors = 'ignore').str.len())
at least the comparison between lengths is vectorize
Also, I have not seen any documentation about a method in pandas.Series.str.isascii
, but if you use python 3.7+, maybe you can create the mask this way:
mask_nonAscii = ~df.column_name.map(str.isascii)
Upvotes: 2