Reputation: 333
I have a dataframe that looks like this:
index key set_col data
0 "a1" ("a", "b") "a1_data"
1 "a2" ("j", "k", "l", "m") "a2_data"
2 "b1" ("z", "y", "x", "w", "v", "u", "t") "b1_data"
I need to split the set_col
, if the length of the set is higher than 3 elements and add it to a duplicated row, with the same data, resulting in this df:
index key set_col data
0 "a1" ("a", "b") "a1_data"
1 "a2" ("j", "k", "l") "a2_data"
2 "a2" ("m") "a2_data"
3 "b1" ("z", "y", "x") "b1_data"
4 "b1" ("w", "v", "u") "b1_data"
5 "b1" ("t") "b1_data"
I have read other answers using explode
, replace
or assign
, like this or this but neither handles the case for splitting lists or sets to a length and duplicating the rows.
On this answer I found the following code:
def split(a, n):
k, m = divmod(len(a), n)
return (a[i*k+min(i, m):(i+1)*k+min(i+1, m)] for i in range(n))
And I try to apply to the columns like this:
df['split_set_col'] = df['set_col'].apply(split(df['set_col'], 3))
But i get the Error:
pandas.errors.SpecificationError: nested renamer is not supported
Upvotes: 2
Views: 54
Reputation: 8768
Here is an option, assuming your set_col
column are tuples:
(df[['key','data']].join(
df['set_col'].explode()
.to_frame()
.assign(cc = lambda x: x.groupby(level=0).cumcount().floordiv(3))
.set_index('cc',append=True)
.groupby(level=[0,1])['set_col']
.agg(tuple)
.droplevel(1)))
Output:
key data set_col
0 a1 a1_date (a, b)
1 a2 a2_data (j, k, l)
1 a2 a2_data (m,)
2 b1 b1_data (x, y, x)
2 b1 b1_data (w, v, u)
2 b1 b1_data (t,)
Upvotes: 0
Reputation: 120469
Your function call is not right:
df['set_col'].apply(split(df['set_col'], 3))
Replace with:
df['set_col'].apply(split, n=3) # note the n=3 as named argument
The function also contains errors, use np.array_split
:
import numpy as np
def split(a, n):
return np.array_split(a, np.arange(0, len(a), n)[1:])
df['split_set_col'] = df['set_col'].apply(split, n=3)
Output:
>>> df.explode('split_set_col', ignore_index=True)
key set_col data split_set_col
0 "a1" (a, b) "a1_data" [a, b]
1 "a2" (j, k, l, m) "a2_data" [j, k, l]
2 "a2" (j, k, l, m) "a2_data" [m]
3 "b1" (z, y, x, w, v, u, t) "b1_data" [z, y, x]
4 "b1" (z, y, x, w, v, u, t) "b1_data" [w, v, u]
5 "b1" (z, y, x, w, v, u, t) "b1_data" [t]
Upvotes: 2