Reputation: 27
I have an RDF file where most objects consist of escaped Unicode characters as follows:
...
<http://dbpedia.org/resource/Ry%C5%8Dgoku_Kokugikan> <http://www.w3.org/2000/01/rdf-schema#label> "\u4E21\u56FD\u56FD\u6280\u9928"@ja .
<http://dbpedia.org/resource/Tunisia> <http://www.w3.org/2000/01/rdf-schema#label> "\u30C1\u30E5\u30CB\u30B8\u30A2"@ja .
...
I want to read in this file using a Python script and convert those objects into readable text, i.e. for the above example I would want the following output:
両国国技館
チュニジア
So far, my code looks as follows:
import codecs
for line in codecs.open("labels-en-uris_ja.nt","r","utf-8"):
tmp = line.split(" ")
label = tmp[2]
label = label.split("@")[0]
label = label.replace("\"","")
print u"{0}".format(label)
However, this returns the escaped Unicode characters unchanged, i.e. as
\u4E21\u56FD\u56FD\u6280\u9928
\u30C1\u30E5\u30CB\u30B8\u30A2
Using simply print label
in the last line of my code gives exactly the same result. However, print u"\u4E21\u56FD\u56FD\u6280\u9928"
gives the desired output, so I assume there is something wrong with the way I read in that file. What would be the correct way to produce the output I want?
Upvotes: 1
Views: 77
Reputation: 11807
You can use the .decode("unicode_escape")
function on your string object to do this.
print u"{0}".format(label.decode("unicode_escape"))
Upvotes: 1