DonutsSauvage
DonutsSauvage

Reputation: 187

Having trouble splitting a dataframe into fixed chunks (per row)

I've read several topics on this site about splitting a pandas dataframe into fixed size chunks, but I'm having a problem that I didn't see adressed here. So here's the process: I ask user for inputs as to how many chunks he wishes to have, then ask for the percentage of the dataframe to allocate per chunk, I verify that the percentages given do not go above 1, then proceed to split accordingly. The following is the last part, the one I'm struggling with:

def dataframe_splitting(df:pd.DataFrame, fracs:list):
    split_frac = []
    print('Size of the dataframe:', df.shape)
    print('fracs:', fracs)
    for i in fracs:
        x = int(i*len(df))
        split_frac.append(x)
    print('split_frac:', split_frac)
    chunks = np.array_split(df, split_frac)
    for x in chunks:
        print(x.shape)
    return chunks

And here is the result given when parameters were: 5 chunks and fracs = [0.1, 0.1, 0.3, 0.2]

Size of the dataframe: (2122905, 79)
fracs: [0.1, 0.1, 0.3, 0.2]
split_fracs: [212290, 212290, 636871, 424581]
(212290, 79)
(0, 79)
(424581, 79)
(0, 79)
(1698324, 79)

As you can see, for the same parameter (0.1) I have one dataframe whose number of rows is 212290 and the one after it is empty. I tried using np.split at first and the results were no different. I really don't know where I'm wrong with this code and why the behaviour is like this.

Upvotes: 0

Views: 563

Answers (2)

Chris
Chris

Reputation: 16162

In order to split into different size data frames it's probably easier to use iloc and iterate over the ranges generated by your calculations. I did something similar to calculate the number of rows per frame, then used a loop and counter to keep track of the start and stop row indicies.

Here's a sample dataframe you can copy and read with pd.read_clipboard()

I printed the results of each dataframe, but feel free to do whatever you like with them.

    a       b           c
1   43.91   -0.041619   43.91
2   43.39   0.011913    43.91
3   45.56   -0.048801   43.91
4   45.43   0.002857    43.91
5   45.33   0.002204    43.91
6   45.68   -0.007692   43.91
7   46.37   -0.014992   43.91
8   48.04   -0.035381   43.91
9   48.38   -0.007053   43.91


fracs = [0.1, 0.1, 0.3, 0.2]

start = 0
for x in [round(df.shape[0]*x) for x in fracs]:
    print(df.iloc[start:start+x])
    start += x  

Output

       a         b      c
1  43.91 -0.041619  43.91
       a         b      c
2  43.39  0.011913  43.91
       a         b      c
3  45.56 -0.048801  43.91
4  45.43  0.002857  43.91
5  45.33  0.002204  43.91
       a         b      c
6  45.68 -0.007692  43.91
7  46.37 -0.014992  43.91

Upvotes: 1

Dmitry Lekhovitsky
Dmitry Lekhovitsky

Reputation: 26

According to np.array_split documentation, the second argument indices_or_sections specifies chunks boundaries rather than chunks sizes. I.e., if we pass an array with a first axis of length N and a list fracs with K elements, the resulting chunks will correspond to indexes [0, fracs[0]), [fracs[0], fracs[1]), ..., [fracs[K-1], N). So, if two consecutive elements of fracs are equal, this will result in a chunk of size 0.

The minimal modification of your code to achieve the expected result is to call np.cumsum on the resulting split_frac variable:

def dataframe_splitting(df:pd.DataFrame, fracs:list):
    split_frac = []
    print('Size of the dataframe:', df.shape)
    print('fracs:', fracs)
    for i in fracs:
        x = int(i*len(df))
        split_frac.append(x)
    chunks = np.array_split(df, np.cumsum(split_frac))  # note the cumsum here
    for x in chunks:
        print(x.shape)
    return chunks

Upvotes: 1

Related Questions