Reputation: 199
I'm trying to read all lines of a TSV file to a list. However, the TSV reader is terminating early and not reading the whole file. I know this because data
is only 1/6 of the length of the whole file. No errors are thrown when this happens.
When I manually inspect the line it terminates on (corresponding to the length of data
, those lines have tons of Unicode symbols. I thought I could catch a UnicodeDecodeError, but instead of throwing an error, it quits out of reading the whole file entirely. I imagine it's hitting something that's triggering an end-of-file??
What's really throwing me for a loop: the error only occurs when I'm using Python 2.7 on Windows Server 2012. The file reads 100% perfectly on Unix implementations of Python 2.7 using both code snippets below. I'm running this inside Anaconda on both.
Here's what I've tried and neither works:
data = []
with open('data.tsv','r') as infile:
csvreader = csv.reader((x.replace('\0', '') for x in infile),
delimiter='\t', quoting=csv.QUOTE_NONE)
data = list(csvreader)
I also tried reading line by line...
with open('data.tsv','r') as infile:
for line in infile:
try:
d = line.split('\t')
q = d[0].decode('utf-8') #where the unicode symbols are located
data.append(d)
except UnicodeDecodeError:
continue
Thanks in advance!
Upvotes: 0
Views: 55
Reputation: 25779
As per general suggestion from the documentation:
If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.
So open your file with:
with open('data.csv', 'rb') as infile:
csvreader = csv.reader(infile, delimiter='\t', quoting=csv.QUOTE_NONE)
data = list(csvreader)
Also, you will have to decode your strings if they have unicode data, or just use unicodecsv
as a drop-in replacement so you don't have to worry about it.
Upvotes: 1