Segmented
Segmented

Reputation: 2044

Pandas dataframe split by memory usage

Is there a way to split up a pandas dataframe into multiple dataframes constrained by memory usage?

Upvotes: 1

Views: 1065

Answers (2)

Zhenhao Chen
Zhenhao Chen

Reputation: 535

def split_dataframe(df, size):

    # size of each row
    row_size = df.memory_usage().sum() / len(df)

    # maximum number of rows of each segment
    row_limit = size // row_size

    # number of segments
    seg_num = (len(df) + row_limit - 1) // row_limit

    # split df
    segments = [df.iloc[i*row_limit : (i+1)*row_limit] for i in range(seg_num)]

    return segments

Upvotes: 4

hume
hume

Reputation: 2553

The easiest way to do this is if the columns of the dataframe are consistent datatypes (i.e., not objects). Here's an example of how you might go about this.

import numpy as np
import pandas as pd
from __future__ import division

df = pd.DataFrame({'a': [1]*100, 'b': [1.1, 2] * 50, 'c': range(100)})

# calculate the number of bytes a row occupies
row_bytes = df.dtypes.apply(lambda x: x.itemsize).sum()

mem_limit = 1024

# get the maximum number of rows in a segment
max_rows = mem_limit / row_bytes

# get the number of dataframes after splitting
n_dfs = np.ceil(df.shape[0] / max_rows)

# get the indices of the dataframe segments
df_segments = np.array_split(df.index, n_dfs)

# create a list of dataframes that are below mem_limit
split_dfs = [df.loc[seg, :] for seg in df_segments]

split_dfs

Also, if you can split by columns instead of rows, pandas has a handy memory_usage method.

Upvotes: 0

Related Questions