Reputation: 20101
I have a very large (larger than cheaper RAM) CSV data set that I'm operating with pandas. Now I'm doing something like
df = pd.read_csv('verylargefile.csv', chunksize=10000)
for df_chunk in df:
df_chunk['new_column'] = df_chunk['old_column'].apply(my_func)
# do other operations and filters...
df_chunk.to_csv('processed.csv', mode='a')
so I can do what I need to operate on the data set and save the output to another file.
The problem starts with doing some groupings and statistics on this data set: I need to calculate the mean, standard deviation and histograms of the entire data set and also graph the results, tendencies, use statsmodels etc. As the samples are not going to be homogeneous I cannot calculate the statistics.
df.groupby('column_1').sum()
TypeError: Cannot groupby on a TextFileReader
I don't have the usual options of selecting only a few columns and I don't see how storing my data on HDF would help. Is there a way?
Upvotes: 4
Views: 1424
Reputation: 28153
This looks like a situation where dask.dataframe
could help. I have included an example below.
For more information see the dask documentation and this very nice tutorial.
In [1]: import dask.dataframe as dd
In [2]: !head data/accounts.0.csv
id,names,amount
138,Ray,1502
1,Tim,5
388,Ingrid,45
202,Zelda,1324
336,Jerry,-1607
456,Laura,-2832
65,Laura,-326
54,Yvonne,341
92,Sarah,3136
In [3]: dask_df = dd.read_csv('data/accounts.0.csv', chunkbytes=4000000)
In [4]: dask_df.npartitions
Out[4]: 4
In [5]: len(dask_df)
Out[5]: 1000000
In [6]: result = dask_df.groupby('names').sum()
In [7]: result.compute()
Out[7]:
id amount
names
Alice 10282524 43233084
Bob 8617531 47512276
Charlie 8056803 47729638
Dan 10146581 32513817
Edith 15164281 37806024
Frank 11310157 63869156
George 14941235 80436603
Hannah 3006336 25721574
Ingrid 10123877 54152865
Jerry 10317245 8613040
Kevin 6809100 16755317
Laura 9941112 34723055
Michael 11200937 36431387
Norbert 5715799 14482698
Oliver 10423117 32415534
Patricia 15289085 22767501
Quinn 10686459 16083432
Ray 10156429 9455663
Sarah 7977036 34970428
Tim 12283567 47851141
Ursula 4893696 37942347
Victor 8864468 15542688
Wendy 9348077 68824579
Xavier 6600945 -3482124
Yvonne 5665415 12701550
Zelda 8491817 42573021
For comparison here is the result using pandas
. The data that I am using here fits in memory, but dask
will work even when the data is larger than memory.
In [8]: import pandas as pd
In [9]: pandas_df = pd.read_csv('data/accounts.0.csv')
In [10]: len(pandas_df)
Out[10]: 1000000
In [11]: pandas_df.groupby('names').sum()
Out[11]:
id amount
names
Alice 10282524 43233084
Bob 8617531 47512276
Charlie 8056803 47729638
Dan 10146581 32513817
Edith 15164281 37806024
Frank 11310157 63869156
George 14941235 80436603
Hannah 3006336 25721574
Ingrid 10123877 54152865
Jerry 10317245 8613040
Kevin 6809100 16755317
Laura 9941112 34723055
Michael 11200937 36431387
Norbert 5715799 14482698
Oliver 10423117 32415534
Patricia 15289085 22767501
Quinn 10686459 16083432
Ray 10156429 9455663
Sarah 7977036 34970428
Tim 12283567 47851141
Ursula 4893696 37942347
Victor 8864468 15542688
Wendy 9348077 68824579
Xavier 6600945 -3482124
Yvonne 5665415 12701550
Zelda 8491817 42573021
Upvotes: 2
Reputation: 863741
Type of df
isn't dataframe
, but TextFileReader
. I think you need concat all chunks to dataframe by function concat
and then apply function:
df = pd.read_csv('verylargefile.csv', chunksize=10000) # gives TextFileReader
df_chunk = concat(df, ignore_index=True)
df_chunk['new_column'] = df_chunk['old_column'].apply(my_func)
# do other operations and filters...
df_chunk.to_csv('processed.csv', mode='a')
EDIT:
Maybe helps this approach: process large dataframe by its groups:
Example:
import pandas as pd
import numpy as np
import io
#test data
temp=u"""id,col1,col2,col3
1,13,15,14
1,13,15,14
1,12,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
3,14,16,13
3,14,15,13
3,14,185,213"""
df = pd.read_csv(io.StringIO(temp), sep=",", usecols=['id', 'col1'])
#drop duplicities, from out you can choose constant
df = df.drop_duplicates()
print df
# id col1
#0 1 13
#2 1 12
#3 2 18
#9 3 14
#for example list of constants
constants = [1,2,3]
#or column id to list of unique values
constants = df['id'].unique().tolist()
print constants
#[1L, 2L, 3L]
for i in constants:
iter_csv = pd.read_csv(io.StringIO(temp), delimiter=",", chunksize=10)
#concat subset with rows id == constant
df = pd.concat([chunk[chunk['id'] == i] for chunk in iter_csv])
#your groupby function
data = df.reset_index(drop=True).groupby(["id","col1"], as_index=False).sum()
print data.to_csv(index=False)
#id,col1,col2,col3
#1,12,15,13
#1,13,30,28
#
#id,col1,col2,col3
#2,18,90,78
#
#id,col1,col2,col3
#3,14,215,239
Upvotes: 1