seb835
seb835

Reputation: 386

Malformed CSV file and Pandas read_csv by chunk

i got a csv file: 22 Go size, 46000000 lines to save memory, csvfile is read and processed by chunk.

tp = pd.read_csv(f_in, sep=',', chunksize=1000, encoding='utf-8',quotechar='"') 
for chunk in tp: 
   chunk;

but the file is malformed and raise an exception :

Error tokenizing data. C error: Expected 87 fields in line 15092657, saw 162

is there a way to trash this chunk and continue the loop with next chunk ?

Upvotes: 1

Views: 3082

Answers (3)

seb835
seb835

Reputation: 386

To intercept the bad line, i use the following code:

# somewhere to store output
err = StringIO.StringIO()
# save a reference to real stderr so we can restore later
oldstderr = sys.stderr
# set stderr to our StringIO instance
sys.stderr = err

tp = pd.read_csv(f_in, sep=',', chunksize=1000, encoding='utf-8',quotechar='"', error_bad_lines=False) 
for chunk in tp:
      chunk

# restore stderr 
sys.stderr = oldstderr

# print(or use) the stored value from previous print
print err.len + 'lines skipped.'
print err.getvalue()
err.close()

Upvotes: 1

seb835
seb835

Reputation: 386

As EdChum says, question was how to skip the chunk, and adding 'error_bad_lines=False' do the trick. Is there a way to intercept the trace giving bad lines and count faulty line ?

Upvotes: 1

kristofferandreasen
kristofferandreasen

Reputation: 897

The question is similar to an earlier asked one found here: Python Pandas Error tokenizing data

As it says in the answers you have to be aware that using error_bad_lines=False removes the line and suggests a better way is to investigate the line in your dataset.

Upvotes: 1

Related Questions