Reputation: 21

How to filter rows with non Latin characters

I am stuck in a problem with a dataframe with a column of film names which has a bunch of non-latin names like Japanese or Chinese (and maybe Russian names too) my code is:

df['title'].head(5)

1 I am legend
2 wonder women
3 アライヴ
4 怪獣総進撃
5 dead sea

I just want an output that removes every non-Latin character title, so I want to remove every row that contains character similar to row 3 and 4, so my desired output is:

df['title'].head(5)

1 I am legend
2 wonder women
5 dead sea
6 the rig
7 altitude

Any help with this code?

Upvotes: 1

Answers (4)

Syed Abdul Haseeb

Reputation: 1

We can easily makes a function which will return whether it is ascii or not and based on that we can then filter our dataframe.

dict_1 = {'col1':list(range(1,6)), 'col2':['I am legend','wonder women','アライヴ','怪獣総進撃','dead sea']}

def check_ascii(string):
    if string.isascii() == True:
        return True
    else:
        return False
    
df = pd.DataFrame(dict_1)
df['is_eng'] = df['col2'].apply(lambda x: check_ascii(x))
df2 = df[df['is_eng'] == True]
df2

Output

Upvotes: 0

mozway

Reputation: 260845

You can use str.match with the Latin character range to identify non-latin characters, and use the boolean output to slice the data:

df_latin = df[~df['title'].str.match(r'.*[^\x00-\xFF]')]

output:

          title
1   I am legend
2  wonder women
5      dead sea
6       the rig
7      altitude

Upvotes: 1

Corralien

Reputation: 120429

You can encode your title column then decode to latin1. If this double transformation does not match your original data, remove row because it contains some non Latin characters:

df = df[df['title'] == df['title'].str.encode('unicode_escape').str.decode('latin1')]
print(df)

# Output
          title
0   I am legend
1  wonder women
3      dead sea

Upvotes: 1

at54321

Reputation: 11728

You can use the isascii() method (if you're using Python 3.7+). Example:

"I am legend".isascii()  # True
"アライヴ".isascii()  # False

Even if you have 1 Non-English letter, the isascii() method will return False.

(Note that for strings like '34?#5' the method will return True, because those are all ASCII characters.)

Upvotes: 0

How to filter rows with non Latin characters

Answers (4)

Related Questions