Reputation: 43
I'm reading a text file that has unicode characters from many different countries. The data in the file is also in JSON format.
I'm working on a CentOS machine. When I open the file in a terminal, the unicode characters display just fine (so my termininal is configured for unicode).
When I test my code in Eclipse, it works fine. When I run my code in the terminal, it throws an error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 17: ordinal not in range(128)
for line in open("data-01083"):
try:
tmp = line
if tmp == "":
break
theData = json.loads(tmp[41:])
for loc in theData["locList"]:
outLine = tmp[:40]
outLine = outLine + delim + theData["names"][0]["name"]
outLine = outLine + delim + str(theData.get("Flagvalue"))
outLine = outLine + delim + str(loc.get("myType"))
flatAdd = ""
srcAddr = loc.get("Address")
if srcAddr != None:
flatAdd = delim + str(srcAddr.get("houseNumber"))
flatAdd = flatAdd + delim + str(srcAddr.get("streetName"))
flatAdd = flatAdd + delim + str(srcAddr.get("postalCode"))
flatAdd = flatAdd + delim + str(srcAddr.get("CountryCode"))
else:
flatAdd = delim + "None" + delim + "None" + delim +"None" + delim +"None" + delim +"None"
outLine = outLine + FlatAdd
sys.stdout.write(("%s\n" % (outLine)).encode('utf-8'))
except:
sys.stdout.write("Error Processing record\n")
So everything works until it gets to StreetName, where it crashes with the UnicodeDecodeError, which is where the non-ascii characters start showing up.
I can fix that instance by added .encode('utf-8'):
flatAdd = flatAdd + delim + str(srcAddr.get("streetName").encode('utf-8'))
but then it crashes with the UnicodeDecodeError on the next line:
outLine = outLine + FlatAdd
I have been stumbling through these types of issues for a month. Any feedback would be greatly appreciated!!
Upvotes: 0
Views: 4702
Reputation: 43
The presentation from Robᵩ (http://nedbatchelder.com/text/unipain.html) REALLY helped with my understanding unicode. HIGHLY recommend it to anyone with unicode issues.
My take away:
For me, I was reading from stdin and a file and output to stdout:
For stdin:
inData = codecs.getreader('utf-8')(sys.stdin)
for a file:
inData = codecs.open("myFile","r","utf-8")
for stdout (do this once before writing anything to stdout):
sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
Upvotes: 1
Reputation:
This might fix your problem. I'm saying might because encoding sometimes makes weird stuff happen ;)
#!/usr/bin/python
# -*- coding: utf-8 -*-
text_file_utf8 = text_file.encode('utf8')
From this point on you should be rid of the messages. If not so, please give feedback on what kind of file you have, the language. Maybe some file header data.
text_file.decode("ISO-8859-1")
might also be a solution.
If all fails, look into codecs()
here; http://docs.python.org/2/library/codecs.html
with codecs.open('your_file.extension', 'r', 'utf8') as indexKey:
pass
# Your code here
Upvotes: 1