Reputation: 1932
I was handling a large csv file, and came across this problem. I am reading in the csv file in chunks and want to extract sub-dataframes based on values for a particular column.
To explain the problem, here is a minimal version:
The CSV (save it as test1.csv, for example)
1,10
1,11
1,12
2,13
2,14
2,15
2,16
3,17
3,18
3,19
3,20
4,21
4,22
4,23
4,24
Now, as you can see, if I read the csv in chunks of 5 rows, the first column's values will be distributed across the chunks. What I want to be able to do is load in memory only the rows for a particular value.
I achieved it using the following:
import pandas as pd
list_of_ids = dict() # this will contain all "id"s and the start and end row index for each id
# read the csv in chunks of 5 rows
for df_chunk in pd.read_csv('test1.csv', chunksize=5, names=['id','val'], iterator=True):
#print(df_chunk)
# In each chunk, get the unique id values and add to the list
for i in df_chunk['id'].unique().tolist():
if i not in list_of_ids:
list_of_ids[i] = [] # initially new values do not have the start and end row index
for i in list_of_ids.keys(): # ---------MARKER 1-----------
idx = df_chunk[df_chunk['id'] == i].index # get row index for particular value of id
if len(idx) != 0: # if id is in this chunk
if len(list_of_ids[i]) == 0: # if the id is new in the final dictionary
list_of_ids[i].append(idx.tolist()[0]) # start
list_of_ids[i].append(idx.tolist()[-1]) # end
else: # if the id was there in previous chunk
list_of_ids[i] = [list_of_ids[i][0], idx.tolist()[-1]] # keep old start, add new end
#print(df_chunk.iloc[idx, :])
#print(df_chunk.iloc[list_of_ids[i][0]:list_of_ids[i][-1], :])
print(list_of_ids)
skip = None
rows = None
# Now from the file, I will read only particular id group using following
# I can again use chunksize argument to read the particular group in pieces
for id, se in list_of_ids.items():
print('Data for id: {}'.format(id))
skip, rows = se[0], (se[-1] - se[0]+1)
for df_chunk in pd.read_csv('test1.csv', chunksize=2, nrows=rows, skiprows=skip, names=['id','val'], iterator=True):
print(df_chunk)
Truncated output from my code:
{1: [0, 2], 2: [3, 6], 3: [7, 10], 4: [11, 14]}
Data for id: 1
id val
0 1 10
1 1 11
id val
2 1 12
Data for id: 2
id val
0 2 13
1 2 14
id val
2 2 15
3 2 16
Data for id: 3
id val
0 3 17
1 3 18
What I want to ask is, do we have a better way of doing this? If you consider MARKER 1 in the code, it is bound to be inefficient as the size grows. I did save memory usage, but, time still remains a problem. Do we have some existing method for this?
(I am looking for complete code in answer)
Upvotes: 0
Views: 646
Reputation: 61910
I suggest you use itertools for this, as follows:
import pandas as pd
import csv
import io
from itertools import groupby, islice
from operator import itemgetter
def chunker(n, iterable):
"""
From answer: https://stackoverflow.com/a/31185097/4001592
>>> list(chunker(3, 'ABCDEFG'))
[['A', 'B', 'C'], ['D', 'E', 'F'], ['G']]
"""
iterable = iter(iterable)
return iter(lambda: list(islice(iterable, n)), [])
chunk_size = 5
with open('test1.csv') as csv_file:
reader = csv.reader(csv_file)
for _, group in groupby(reader, itemgetter(0)):
for chunk in chunker(chunk_size, group):
g = [','.join(e) for e in chunk]
df = pd.read_csv(io.StringIO('\n'.join(g)), header=None)
print(df)
print('---')
Output (partial)
0 1
0 1 10
1 1 11
2 1 12
---
0 1
0 2 13
1 2 14
2 2 15
3 2 16
---
0 1
0 3 17
1 3 18
2 3 19
3 3 20
---
...
This approach will read first in groups by column 1:
for _, group in groupby(reader, itemgetter(0)):
and each group will be read in chunks of 5 rows (this can be change using chunk_size):
for chunk in chunker(chunk_size, group):
The last part:
g = [','.join(e) for e in chunk]
df = pd.read_csv(io.StringIO('\n'.join(g)), header=None)
print(df)
print('---')
creates a suitable string to be pass to pandas.
Upvotes: 1