user2498957
user2498957

Reputation: 27

Python: Read in escaped Unicode characters and turn them into readable text

I have an RDF file where most objects consist of escaped Unicode characters as follows:

...
<http://dbpedia.org/resource/Ry%C5%8Dgoku_Kokugikan> <http://www.w3.org/2000/01/rdf-schema#label> "\u4E21\u56FD\u56FD\u6280\u9928"@ja .
<http://dbpedia.org/resource/Tunisia> <http://www.w3.org/2000/01/rdf-schema#label> "\u30C1\u30E5\u30CB\u30B8\u30A2"@ja .
...

I want to read in this file using a Python script and convert those objects into readable text, i.e. for the above example I would want the following output:

両国国技館
チュニジア

So far, my code looks as follows:

import codecs

for line in codecs.open("labels-en-uris_ja.nt","r","utf-8"):
    tmp = line.split(" ")
    label = tmp[2]
    label = label.split("@")[0]
    label = label.replace("\"","")
    print u"{0}".format(label)

However, this returns the escaped Unicode characters unchanged, i.e. as

\u4E21\u56FD\u56FD\u6280\u9928
\u30C1\u30E5\u30CB\u30B8\u30A2

Using simply print label in the last line of my code gives exactly the same result. However, print u"\u4E21\u56FD\u56FD\u6280\u9928" gives the desired output, so I assume there is something wrong with the way I read in that file. What would be the correct way to produce the output I want?

Upvotes: 1

Views: 77

Answers (1)

Aaron Christiansen
Aaron Christiansen

Reputation: 11807

You can use the .decode("unicode_escape") function on your string object to do this.

print u"{0}".format(label.decode("unicode_escape"))

Upvotes: 1

Related Questions