Reputation: 497
Currently trying to get a sample of a very large dataset ~54 gb. However since I know that getting anything larger than 1 gb becomes very inefficient, I only want to read in the first 100k rows. This is what I have so far:
df_chunk = pd.read_csv(r'pol.csv', chunksize=1000000, engine='python')
chunk_list = [] # append each chunk df here
# Each chunk is in df format
for chunk in df_chunk:
# Once the data filtering is done, append the chunk to list
chunk_list.append(chunk)
# concat the list into dataframe
df_concat = pd.concat(chunk_list)
However running this gives me this error:
File "path", line 3121, in _get_lines new_rows.append(next(self.data)) _csv.Error: ',' expected after '"'
Changing the engine to C throws a parsing error, and then setting low_memory = False
doesn't work with the python engine. Also setting setting error_bad_lines= True
skips way to many rows from the dataset.
I just need a small chunk of the dataset to work with but its extremely hard to even get just that.
Upvotes: 0
Views: 160
Reputation: 2576
There seems to be some formatting issues in your big CSV. I suggest you first make a smaller file with just a fraction of the data and inspect that manually for the formatting issues: They need to be fixed in order to be parsed successfully. To extract some portion, do
with open('pol.csv') as f:
with open('pol_part.csv','w') as g:
for i in range(1000): # replace 1000 with 100000 when ready
g.write(f.readline())
Upvotes: 2