Pranab
Pranab

Reputation: 2797

Filter pandas dataframe rows containing non-ascii values

I have a column with addresses and want to find all rows that contain 'foreign' i.e. non-ASCII characters.

import pandas as pd

df = pd.DataFrame.from_dict({
    'column_name': ["GREENLAND HOTEL, CENTRAL AVENUE, NAGPUR-440 018.", "Møllegade 1234567 DK-6400 Sønderborg Denmark"],
    'column_other': ["0", "1"]
})

Expected output is it will display only the 2nd row which contains the "ø" character.

I tried this:

df['column_name'].str.isascii()

but in Python 3.6 at least this does not work.

In MySQL I can do this equivalent

SELECT * FROM `table_name` WHERE `column_name`!=CONVERT(`column_name` USING ASCII)

which works.

Upvotes: 3

Views: 3310

Answers (2)

tejkweku
tejkweku

Reputation: 19

In this example, the dataframe is named data

  1. First of all make a list of columns of string datatype

    cols = ["A", "B", "C"]

  2. Run the code below to loop through the columns to state the number of values in each column that have the non-ascii characters

for col in cols:    
    print(col, data[col].str.len().ne(data[col].str.encode("ascii", errors="ignore").str.len()).sum())

A 0
B 3
C 0

For this example, column B has the non-ascii values

  1. Run the modified code below to get the rows with the non-ascii values

    data["B"].str.len().ne(data["B"].str.encode("ascii", errors="ignore").str.len())

Upvotes: 1

Ben.T
Ben.T

Reputation: 29635

Not sure about the performance, but you can try to encode the string and ignore the errors, then compare the len of the encoded string to the original one like:

mask_nonAscii = df.column_name.map(lambda x: len(x.encode('ascii', errors='ignore')) != len(x))
print (df[mask_nonAscii])
                                    column_name column_other
1  Møllegade 1234567 DK-6400 Sønderborg Denmark            1

EDIT: thanks to a comment of anky_91 you can also use str.len and str.encode from pandas like:

mask_nonAscii = df['column_name'].str.len()\
                 .ne(df['column_name'].str.encode('ascii',errors = 'ignore').str.len())

at least the comparison between lengths is vectorize


Also, I have not seen any documentation about a method in pandas.Series.str.isascii, but if you use python 3.7+, maybe you can create the mask this way:

mask_nonAscii = ~df.column_name.map(str.isascii)

Upvotes: 2

Related Questions