Adam Iqshan
Adam Iqshan

Reputation: 145

Filter Non English Keywords from Python List

I have the below python list,

List= ['Images', 'Maps', 'Play', 'YouTube', 'News', 'Gmail', 'Drive', None, 
'Web History', 'Settings', 'Sign in', 'Advanced search', 'Language tools', 
'हिन्दी', 'বাংলা', 'తెలుగు', 'मराठी', 'தமிழ்', 'ગુજરાતી', 'ಕನ್ನಡ', 'മലയാളം', 
'ਪੰਜਾਬੀ', 'Advertising\xa0Programs', 'Business Solutions', '+Google', 
'About Google', 'Google.co.in', 'Privacy', 'Terms']

I want to filter non english keywords from this list and want my final list to look like this,

List=['हिन्दी', 'বাংলা', 'తెలుగు', 'मराठी', 'தமிழ்', 'ગુજરાતી', 'ಕನ್ನಡ', 'മലയാളം','ਪੰਜਾਬੀ']

Is this can be done with Regex? I use Python 3.x Thanks for help!

Upvotes: 0

Views: 1248

Answers (2)

blhsing
blhsing

Reputation: 106768

Since non-English characters are all above the 7-bit ASCII range, you can test if the ordinal numbers of any of the characters in each word are above 127 and is considered an alphabet by str.isalpha():

[w for w in List if w and any(ord(c) > 127 and c.isalpha() for c in w)]

With your sample input, this returns:

['हिन्दी', 'বাংলা', 'తెలుగు', 'मराठी', 'தமிழ்', 'ગુજરાતી', 'ಕನ್ನಡ', 'മലയാളം', 'ਪੰਜਾਬੀ']

Upvotes: 2

Henry Yik
Henry Yik

Reputation: 22503

It is also doable in regex.

import re

result = ["".join(re.findall("[^\u0000-\u05C0]",i)) for i in List if i is not None and re.findall("[^\u0000-\u05C0]",i)]

print (result)

Result:

['हिन्दी', 'বাংলা', 'తెలుగు', 'मराठी', 'தமிழ்', 'ગુજરાતી', 'ಕನ್ನಡ', 'മലയാളം', 'ਪੰਜਾਬੀ']

Upvotes: 1

Related Questions