Reputation: 4200
I am trying to load this CSV file into a pandas data frame using
import pandas as pd
filename = '2016-2018_wave-IV.csv'
df = pd.read_csv(filename)
However, despite my PC being not super slow (8GB RAM, 64 bit python) and the file being somewhat but not extraordinarily large (< 33 MB), loading the file takes more than 10 minutes. It is my understanding that this shouldn't take nearly that long and I would like to figure out what's behind this.
(As suggested in similar questions, I have tried using chunksize
and usecol
parameters (EDIT and also low_memory
), yet without success; so I believe this is not a duplicate but has more to do with the file or the setup.)
Could someone give me a pointer? Many thanks. :)
Upvotes: 3
Views: 823
Reputation: 4200
To summarize and expand the answer by @Hubert Dudek:
The issue was with the file; not only did it include "
s at the start of every line but also in the lines themselves. After I fixed the former, the latter caused the column attribution being messed up.
Upvotes: 0
Reputation: 1722
I was testing the file which you shared and problem is that this csv file have leading and ending double quotes on every line (so Panda thinks that whole line is one column). It have to be removed before processing for example by using sed in linux or just process and re-save file in python or just replace all double quotes in text editor.
Upvotes: 1