Reputation: 407
Below is the input data
A B C D
R1 J1 D1 S1,S2
R1 J1 D1 S3,S4,S5
R1 J1 D2 S5,S6,S2
R1 J1 D2 S7,S8
P1 J2 E1 T1,T2
P1 J2 E1 T3,T4,T5
P1 J2 E2 T5,T6,T2
P1 J2 E3 T7,T8,T5
Expected output with no repeated values in column D :
A B C D
R1 J1 D1,D2 S1,S2,S3,S4,S5,S6,S7,S8
P1 J2 E1,E2,E3 T1,T2,T3,T4,T5,T6,T7,T8
The script which I have tried and that not working
df.groupby(['A','B'])[['C','D']].agg([','.join]).reset_index()
Upvotes: 2
Views: 80
Reputation: 862741
Use lambda function with remove duplicates by convert splitted values to dictionaries by dict.fromkeys
in GroupBy.agg
:
f = lambda x: ','.join(dict.fromkeys(z for y in x for z in y.split(',')))
df = df.groupby(['A','B'], sort=False)[['C','D']].agg(f).reset_index()
print (df)
A B C D
0 R1 J1 D1,D2 S1,S2,S3,S4,S5,S6,S7,S8
1 P1 J2 E1,E2,E3 T1,T2,T3,T4,T5,T6,T7,T8
Performance depends of length of DataFrame, number of groups and length of splitted values. Best test in real data:
np.random.seed(785)
#sample 1k rows
N = 1000
a = [','.join(np.random.choice(list(string.ascii_letters[:random.randint(3, 10)]), random.randint(3, 10)))
for _ in range(N)]
df = pd.DataFrame({'A':np.random.randint(20, size=N),
'B':np.random.randint(10, size=N),
'C':a,
'D':sorted(a)})
df = df.sort_values(['A','B'])
In [36]: %timeit df.groupby(['A','B'])[['C','D']].agg(lambda x: ','.join(x.str.split(',').explode().unique())).reset_index()
154 ms ± 1.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [37]: %timeit df.groupby(['A','B'], sort=False)[['C','D']].agg(lambda x: ','.join(dict.fromkeys(z for y in x for z in y.split(',')))).reset_index()
11.4 ms ± 426 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Another sample:
np.random.seed(785)
#5k rows
N = 5000
a = [','.join(np.random.choice(list(string.ascii_letters[:random.randint(3, 10)]), random.randint(3, 10)))
for _ in range(N)]
df = pd.DataFrame({'A':np.random.randint(200, size=N),
'B':np.random.randint(10, size=N),
'C':a,
'D':sorted(a)})
df = df.sort_values(['A','B'])
print (df.head(10))
In [43]: %timeit df.groupby(['A','B'])[['C','D']].agg(lambda x: ','.join(x.str.split(',').explode().unique())).reset_index()
1.47 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [44]: %timeit df.groupby(['A','B'], sort=False)[['C','D']].agg(lambda x: ','.join(dict.fromkeys(z for y in x for z in y.split(',')))).reset_index()
56.9 ms ± 518 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 1
Reputation: 71580
Try using str.split
to split on the commas, then use explode
to flatten the Series
and use unique
to remove the duplicates:
df.groupby(['A','B'])[['C','D']].agg(lambda x: ','.join(x.str.split(',').explode().unique())).reset_index()
Output:
A B C D
0 P1 J2 E1,E2,E3 T1,T2,T3,T4,T5,T6,T7,T8
1 R1 J1 D1,D2 S1,S2,S3,S4,S5,S6,S7,S8
Upvotes: 2