Reputation: 145
I have the below python list,
List= ['Images', 'Maps', 'Play', 'YouTube', 'News', 'Gmail', 'Drive', None,
'Web History', 'Settings', 'Sign in', 'Advanced search', 'Language tools',
'हिन्दी', 'বাংলা', 'తెలుగు', 'मराठी', 'தமிழ்', 'ગુજરાતી', 'ಕನ್ನಡ', 'മലയാളം',
'ਪੰਜਾਬੀ', 'Advertising\xa0Programs', 'Business Solutions', '+Google',
'About Google', 'Google.co.in', 'Privacy', 'Terms']
I want to filter non english keywords from this list and want my final list to look like this,
List=['हिन्दी', 'বাংলা', 'తెలుగు', 'मराठी', 'தமிழ்', 'ગુજરાતી', 'ಕನ್ನಡ', 'മലയാളം','ਪੰਜਾਬੀ']
Is this can be done with Regex? I use Python 3.x Thanks for help!
Upvotes: 0
Views: 1248
Reputation: 106768
Since non-English characters are all above the 7-bit ASCII range, you can test if the ordinal numbers of any of the characters in each word are above 127 and is considered an alphabet by str.isalpha()
:
[w for w in List if w and any(ord(c) > 127 and c.isalpha() for c in w)]
With your sample input, this returns:
['हिन्दी', 'বাংলা', 'తెలుగు', 'मराठी', 'தமிழ்', 'ગુજરાતી', 'ಕನ್ನಡ', 'മലയാളം', 'ਪੰਜਾਬੀ']
Upvotes: 2
Reputation: 22503
It is also doable in regex.
import re
result = ["".join(re.findall("[^\u0000-\u05C0]",i)) for i in List if i is not None and re.findall("[^\u0000-\u05C0]",i)]
print (result)
Result:
['हिन्दी', 'বাংলা', 'తెలుగు', 'मराठी', 'தமிழ்', 'ગુજરાતી', 'ಕನ್ನಡ', 'മലയാളം', 'ਪੰਜਾਬੀ']
Upvotes: 1