Reputation: 187
I've read several topics on this site about splitting a pandas dataframe into fixed size chunks, but I'm having a problem that I didn't see adressed here. So here's the process: I ask user for inputs as to how many chunks he wishes to have, then ask for the percentage of the dataframe to allocate per chunk, I verify that the percentages given do not go above 1, then proceed to split accordingly. The following is the last part, the one I'm struggling with:
def dataframe_splitting(df:pd.DataFrame, fracs:list):
split_frac = []
print('Size of the dataframe:', df.shape)
print('fracs:', fracs)
for i in fracs:
x = int(i*len(df))
split_frac.append(x)
print('split_frac:', split_frac)
chunks = np.array_split(df, split_frac)
for x in chunks:
print(x.shape)
return chunks
And here is the result given when parameters were: 5 chunks and fracs = [0.1, 0.1, 0.3, 0.2]
Size of the dataframe: (2122905, 79)
fracs: [0.1, 0.1, 0.3, 0.2]
split_fracs: [212290, 212290, 636871, 424581]
(212290, 79)
(0, 79)
(424581, 79)
(0, 79)
(1698324, 79)
As you can see, for the same parameter (0.1) I have one dataframe whose number of rows is 212290 and the one after it is empty. I tried using np.split at first and the results were no different. I really don't know where I'm wrong with this code and why the behaviour is like this.
Upvotes: 0
Views: 563
Reputation: 16162
In order to split into different size data frames it's probably easier to use iloc and iterate over the ranges generated by your calculations. I did something similar to calculate the number of rows per frame, then used a loop and counter to keep track of the start and stop row indicies.
Here's a sample dataframe you can copy and read with pd.read_clipboard()
I printed the results of each dataframe, but feel free to do whatever you like with them.
a b c
1 43.91 -0.041619 43.91
2 43.39 0.011913 43.91
3 45.56 -0.048801 43.91
4 45.43 0.002857 43.91
5 45.33 0.002204 43.91
6 45.68 -0.007692 43.91
7 46.37 -0.014992 43.91
8 48.04 -0.035381 43.91
9 48.38 -0.007053 43.91
fracs = [0.1, 0.1, 0.3, 0.2]
start = 0
for x in [round(df.shape[0]*x) for x in fracs]:
print(df.iloc[start:start+x])
start += x
Output
a b c
1 43.91 -0.041619 43.91
a b c
2 43.39 0.011913 43.91
a b c
3 45.56 -0.048801 43.91
4 45.43 0.002857 43.91
5 45.33 0.002204 43.91
a b c
6 45.68 -0.007692 43.91
7 46.37 -0.014992 43.91
Upvotes: 1
Reputation: 26
According to np.array_split
documentation, the second argument indices_or_sections
specifies chunks boundaries rather than chunks sizes. I.e., if we pass an array with a first axis of length N
and a list fracs
with K
elements, the resulting chunks will correspond to indexes [0, fracs[0])
, [fracs[0], fracs[1])
, ..., [fracs[K-1], N)
. So, if two consecutive elements of fracs
are equal, this will result in a chunk of size 0.
The minimal modification of your code to achieve the expected result is to call np.cumsum
on the resulting split_frac
variable:
def dataframe_splitting(df:pd.DataFrame, fracs:list):
split_frac = []
print('Size of the dataframe:', df.shape)
print('fracs:', fracs)
for i in fracs:
x = int(i*len(df))
split_frac.append(x)
chunks = np.array_split(df, np.cumsum(split_frac)) # note the cumsum here
for x in chunks:
print(x.shape)
return chunks
Upvotes: 1