Having trouble splitting a dataframe into fixed chunks (per row)

Question

I've read several topics on this site about splitting a pandas dataframe into fixed size chunks, but I'm having a problem that I didn't see adressed here. So here's the process: I ask user for inputs as to how many chunks he wishes to have, then ask for the percentage of the dataframe to allocate per chunk, I verify that the percentages given do not go above 1, then proceed to split accordingly. The following is the last part, the one I'm struggling with:

def dataframe_splitting(df:pd.DataFrame, fracs:list):
    split_frac = []
    print('Size of the dataframe:', df.shape)
    print('fracs:', fracs)
    for i in fracs:
        x = int(i*len(df))
        split_frac.append(x)
    print('split_frac:', split_frac)
    chunks = np.array_split(df, split_frac)
    for x in chunks:
        print(x.shape)
    return chunks

And here is the result given when parameters were: 5 chunks and fracs = [0.1, 0.1, 0.3, 0.2]

Size of the dataframe: (2122905, 79)
fracs: [0.1, 0.1, 0.3, 0.2]
split_fracs: [212290, 212290, 636871, 424581]
(212290, 79)
(0, 79)
(424581, 79)
(0, 79)
(1698324, 79)

As you can see, for the same parameter (0.1) I have one dataframe whose number of rows is 212290 and the one after it is empty. I tried using np.split at first and the results were no different. I really don't know where I'm wrong with this code and why the behaviour is like this.

Dmitry Lekhovitsky · Accepted Answer

According to np.array_split documentation, the second argument indices_or_sections specifies chunks boundaries rather than chunks sizes. I.e., if we pass an array with a first axis of length N and a list fracs with K elements, the resulting chunks will correspond to indexes [0, fracs[0]), [fracs[0], fracs[1]), ..., [fracs[K-1], N). So, if two consecutive elements of fracs are equal, this will result in a chunk of size 0.

The minimal modification of your code to achieve the expected result is to call np.cumsum on the resulting split_frac variable:

def dataframe_splitting(df:pd.DataFrame, fracs:list):
    split_frac = []
    print('Size of the dataframe:', df.shape)
    print('fracs:', fracs)
    for i in fracs:
        x = int(i*len(df))
        split_frac.append(x)
    chunks = np.array_split(df, np.cumsum(split_frac))  # note the cumsum here
    for x in chunks:
        print(x.shape)
    return chunks

Having trouble splitting a dataframe into fixed chunks (per row)

Answers (2)

Related Questions