Filter Non English Keywords from Python List

Question

I have the below python list,

List= ['Images', 'Maps', 'Play', 'YouTube', 'News', 'Gmail', 'Drive', None, 
'Web History', 'Settings', 'Sign in', 'Advanced search', 'Language tools', 
'हिन्दी', 'বাংলা', 'తెలుగు', 'मराठी', 'தமிழ்', 'ગુજરાતી', 'ಕನ್ನಡ', 'മലയാളം', 
'ਪੰਜਾਬੀ', 'Advertising\xa0Programs', 'Business Solutions', '+Google', 
'About Google', 'Google.co.in', 'Privacy', 'Terms']

I want to filter non english keywords from this list and want my final list to look like this,

List=['हिन्दी', 'বাংলা', 'తెలుగు', 'मराठी', 'தமிழ்', 'ગુજરાતી', 'ಕನ್ನಡ', 'മലയാളം','ਪੰਜਾਬੀ']

Is this can be done with Regex? I use Python 3.x Thanks for help!

blhsing · Accepted Answer

Since non-English characters are all above the 7-bit ASCII range, you can test if the ordinal numbers of any of the characters in each word are above 127 and is considered an alphabet by str.isalpha():

[w for w in List if w and any(ord(c) > 127 and c.isalpha() for c in w)]

With your sample input, this returns:

['हिन्दी', 'বাংলা', 'తెలుగు', 'मराठी', 'தமிழ்', 'ગુજરાતી', 'ಕನ್ನಡ', 'മലയാളം', 'ਪੰਜਾਬੀ']

Filter Non English Keywords from Python List

Answers (2)

Related Questions