Reputation: 115
So I'm trying to parse out JSON files into a tab delimited file. The parsing seems to work fine and all the data is coming through. Although the oddest thing is happening on the output file. I told it to use a tab delimiter and on the output it does use tabs, but it still seems to keep the single quotes. And for some reason it also seems to be adding the letter B to the beginning. I manually typed in the header, and that works fine, but the data itself is acting weird. Here's an example of the output I'm getting.
id created text screen name name latitude longitude place name place type
b'1234567890' b'Thu Mar 14 19:39:07 +0000 2013' "b""I'm at Bank Of America (Wayne, MI) http://t.co/asdf""" b'userid' b'username' 42.28286837 -83.38487864 b'Bank Of America, Wayne' b'poi'
b'1234567891' b'Thu Mar 14 19:39:16 +0000 2013' b'here is a sample tweet \xf0\x9f\x8f\x80 #notingoodhands' b'userid2' b'username2'
Here is the code that I'm using to write the data out.
out = open(filename, 'w')
out.write('id\tcreated\ttext\tscreen name\tname\tlatitude\tlongitude\tplace name\tplace type')
out.write('\n')
rows = zip(ids, times, texts, screen_names, names, lats, lons, place_names, place_types)
from csv import writer
csv = writer(out, dialect='excel', delimiter = '\t')
for row in rows:
values = [(value.encode('utf-8') if hasattr(value, 'encode') else value) for value in row]
csv.writerow(values)
out.close()
So here's the thing. If i did this without the utf-8 bit and just output it straight, the formatting would be perfectly how i want it. But then when people type in special characters, the program crashes and isn't able to handle it.
Traceback (most recent call last):
File "tweets.py", line 34, in <module>
csv.writerow(values)
File "C:\Python33\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f3c0' in position 153: character maps to <undefined>
Adding the utf-8 bit converts it to the type of output you see here, but then it adds all these characters to the output. Does anyone have any thoughts on this?
Upvotes: 9
Views: 10835
Reputation: 365707
You've got multiple things going on here, but first, let's clear up a bit of confusion.
Encoding non-ASCII characters to UTF-8 means you get multiple bytes. For example, the character 🏀
is \xf0\x9f\x8f\x80
in UTF-8. But that's still just one character, it's just a character that takes four bytes. If you write the string to a binary file, then look at that file in a UTF-8-compatible tool (Notepad or TextEdit, or just cat
on a UTF-8-friendly terminal/shell), you'll see one 🏀
, not four garbage characters.
Second, b'abc'
is not a string with b
added to the beginning, it's the repr
representation of the byte-string abc
. The b
is no more a part of the string than the quotes are.
Finally, in Python 3, you can't open a file in text mode and then write byte strings to it. Either open it in text mode, with an encoding, and write normal unicode strings, or open it in binary mode and write encoded byte strings.
Upvotes: 1
Reputation: 1121744
You are writing byte data instead of unicode to your files, because you are encoding the data yourself.
Remove the encode
calls altogether and let Python handle this for you; open the file with the UTF8 encoding and the rest takes care of itself:
out = open(filename, 'w', encoding='utf8')
This is documented in the csv
module documentation:
Since
open()
is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (seelocale.getpreferredencoding()
). To decode a file using a different encoding, use the encoding argument of open:import csv with open('some.csv', newline='', encoding='utf-8') as f: reader = csv.reader(f) for row in reader: print(row)
The same applies to writing in something other than the system default encoding: specify the encoding argument when opening the output file.
Upvotes: 13