Reputation: 386
i got a csv file: 22 Go size, 46000000 lines to save memory, csvfile is read and processed by chunk.
tp = pd.read_csv(f_in, sep=',', chunksize=1000, encoding='utf-8',quotechar='"')
for chunk in tp:
chunk;
but the file is malformed and raise an exception :
Error tokenizing data. C error: Expected 87 fields in line 15092657, saw 162
is there a way to trash this chunk and continue the loop with next chunk ?
Upvotes: 1
Views: 3082
Reputation: 386
To intercept the bad line, i use the following code:
# somewhere to store output
err = StringIO.StringIO()
# save a reference to real stderr so we can restore later
oldstderr = sys.stderr
# set stderr to our StringIO instance
sys.stderr = err
tp = pd.read_csv(f_in, sep=',', chunksize=1000, encoding='utf-8',quotechar='"', error_bad_lines=False)
for chunk in tp:
chunk
# restore stderr
sys.stderr = oldstderr
# print(or use) the stored value from previous print
print err.len + 'lines skipped.'
print err.getvalue()
err.close()
Upvotes: 1
Reputation: 386
As EdChum says, question was how to skip the chunk, and adding 'error_bad_lines=False' do the trick. Is there a way to intercept the trace giving bad lines and count faulty line ?
Upvotes: 1
Reputation: 897
The question is similar to an earlier asked one found here: Python Pandas Error tokenizing data
As it says in the answers you have to be aware that using error_bad_lines=False removes the line and suggests a better way is to investigate the line in your dataset.
Upvotes: 1