Reputation: 2044
Is there a way to split up a pandas dataframe into multiple dataframes constrained by memory usage?
Upvotes: 1
Views: 1065
Reputation: 535
def split_dataframe(df, size):
# size of each row
row_size = df.memory_usage().sum() / len(df)
# maximum number of rows of each segment
row_limit = size // row_size
# number of segments
seg_num = (len(df) + row_limit - 1) // row_limit
# split df
segments = [df.iloc[i*row_limit : (i+1)*row_limit] for i in range(seg_num)]
return segments
Upvotes: 4
Reputation: 2553
The easiest way to do this is if the columns of the dataframe are consistent datatypes (i.e., not objects). Here's an example of how you might go about this.
import numpy as np
import pandas as pd
from __future__ import division
df = pd.DataFrame({'a': [1]*100, 'b': [1.1, 2] * 50, 'c': range(100)})
# calculate the number of bytes a row occupies
row_bytes = df.dtypes.apply(lambda x: x.itemsize).sum()
mem_limit = 1024
# get the maximum number of rows in a segment
max_rows = mem_limit / row_bytes
# get the number of dataframes after splitting
n_dfs = np.ceil(df.shape[0] / max_rows)
# get the indices of the dataframe segments
df_segments = np.array_split(df.index, n_dfs)
# create a list of dataframes that are below mem_limit
split_dfs = [df.loc[seg, :] for seg in df_segments]
split_dfs
Also, if you can split by columns instead of rows, pandas has a handy memory_usage
method.
Upvotes: 0