David
David

Reputation: 1569

How can I slice a substring from a unicode string with Python?

I have a unicode string as a result : u'splunk>\xae\uf001'

How can I get the substring 'uf001'

as a simple string in python?

Upvotes: 2

Views: 4760

Answers (3)

jfs
jfs

Reputation: 414865

u'' it is how a Unicode string is represented in Python source code. REPL uses this representation by default to display unicode objects:

>>> u'splunk>\xae\uf001'
u'splunk>\xae\uf001'
>>> print(u'splunk>\xae\uf001')
splunk>®
>>> print(u'splunk>\xae\uf001'[-1])


If your terminal is not configured to display Unicode or if you are on a narrow build (e.g., it is likely for Python 2 on Windows) then the result may be different.

Unicode string is an immutable sequence of Unicode codepoints in Python. len(u'\uf001') == 1: it does not contain uf001 (5 characters) in it. You could write it as u'' (it is necessary to declare the character encoding of your source file on Python 2 if you use non-ascii characters):

>>> u'\uf001' == u''
True

It is just a different way to represent exactly the same Unicode character (a single codepoint in this case).

Note: some user-perceived characters may span several Unicode codepoints e.g.:

>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'ё')
u'\u0435\u0308'
>>> print(unicodedata.normalize('NFKD', u'ё'))
ё

Upvotes: 1

Anand S Kumar
Anand S Kumar

Reputation: 91009

Since you want the actual string (as seen from comments) , just get the last character [-1] index , Example -

>>> a = u'splunk>\xae\uf001'
>>> print(a)
splunk>®ï€
>>> a[-1]
'\uf001'
>>> print(a[-1])
ï€

If you want the unicode representation (\uf001) , then take repr(a[-1]) , Example -

>>> repr(a[-1])
"'\\uf001'"

\uf001 is a single unicode character (not multiple strings) , so you can directly get that character as above.

You see \uf001 because you are checking the results of repr() on the string, if you print it, or use it somewhere else (like for files, etc) it will be the correct \uf001 character.

Upvotes: 2

Amadan
Amadan

Reputation: 198526

The characters uf001 are not actually present in the string, so you can't just slice them off. You can do

repr(s)[-6:-1]

or

'u' + hex(ord(s[-1]))[2:]

Upvotes: 2

Related Questions