Reputation: 389

python split a unicode string by 3-bytes utf8 character

suppose that we have a unicode string in python,

s = u"abc你好def啊"

Now I want to split that by no-ascii characters, with result like result = ["abc", "你好", "def", "啊"]

So, how to implement that?

Upvotes: 3

Answers (4)

tripleee

Reputation: 189517

Just ... split.

[s[i:i+3] for i in xrange(0, len(s), 3)]

http://ideone.com/PeoGaF

Upvotes: 0

minocha

Reputation: 1044

s = "abc你好def啊"
filter(None, re.split('(\w+|\W+)', s))

works in python 2.x versions

Upvotes: 1

YOBA

Reputation: 2807

You can do something like this:

s = u"abc你好def啊"
status = ord(s[0]) < 128 
word = ""
res =[]

for b, letter  in zip([ ord(c) < 128 for c in s ], s):
      if b != status:
          res.append(word)
          status = b
          word = ""
      word += letter
res.append(word)

print res
>> ["abc", "你好", "def", "啊"]

Upvotes: 2

C14L

Reputation: 12558

With regex you could simply split between "has or has not" a-z chars.

>>> import re
>>> re.findall('([a-zA-Z0-9]+|[^a-zA-Z0-9]+)', u"abc你好def啊")
["abc", "你好", "def", "啊"]

Or, with all ASCIIs

>>> ascii = ''.join(chr(x) for x in range(33, 127))
>>> re.findall('([{}]+|[^{}]+)'.format(ascii, ascii), u"abc你好def啊")
['abc', '你好', 'def', '啊']

Or, even simpler as suggested by @Dolda2000

>>> re.findall('([ -~]+|[^ -~]+)', u"abc你好def啊")
['abc', '你好', 'def', '啊']

Upvotes: 5

python split a unicode string by 3-bytes utf8 character

Answers (4)

Related Questions