Reputation: 389
suppose that we have a unicode string in python,
s = u"abc你好def啊"
Now I want to split that by no-ascii characters, with result like
result = ["abc", "你好", "def", "啊"]
So, how to implement that?
Upvotes: 3
Views: 901
Reputation: 1044
s = "abc你好def啊"
filter(None, re.split('(\w+|\W+)', s))
works in python 2.x versions
Upvotes: 1
Reputation: 2807
You can do something like this:
s = u"abc你好def啊"
status = ord(s[0]) < 128
word = ""
res =[]
for b, letter in zip([ ord(c) < 128 for c in s ], s):
if b != status:
res.append(word)
status = b
word = ""
word += letter
res.append(word)
print res
>> ["abc", "你好", "def", "啊"]
Upvotes: 2
Reputation: 12558
With regex you could simply split between "has or has not" a-z chars.
>>> import re
>>> re.findall('([a-zA-Z0-9]+|[^a-zA-Z0-9]+)', u"abc你好def啊")
["abc", "你好", "def", "啊"]
Or, with all ASCIIs
>>> ascii = ''.join(chr(x) for x in range(33, 127))
>>> re.findall('([{}]+|[^{}]+)'.format(ascii, ascii), u"abc你好def啊")
['abc', '你好', 'def', '啊']
Or, even simpler as suggested by @Dolda2000
>>> re.findall('([ -~]+|[^ -~]+)', u"abc你好def啊")
['abc', '你好', 'def', '啊']
Upvotes: 5