AB14
AB14

Reputation: 407

Python Groupby on String values

Below is the input data

A      B       C       D
R1     J1      D1     S1,S2
R1     J1      D1     S3,S4,S5 
R1     J1      D2     S5,S6,S2
R1     J1      D2     S7,S8 
P1     J2      E1     T1,T2 
P1     J2      E1     T3,T4,T5
P1     J2      E2     T5,T6,T2
P1     J2      E3     T7,T8,T5

Expected output with no repeated values in column D :

A     B        C             D
R1    J1      D1,D2         S1,S2,S3,S4,S5,S6,S7,S8
P1    J2      E1,E2,E3      T1,T2,T3,T4,T5,T6,T7,T8

The script which I have tried and that not working

df.groupby(['A','B'])[['C','D']].agg([','.join]).reset_index()

Upvotes: 2

Views: 80

Answers (2)

jezrael
jezrael

Reputation: 862741

Use lambda function with remove duplicates by convert splitted values to dictionaries by dict.fromkeys in GroupBy.agg:

f = lambda x: ','.join(dict.fromkeys(z for y in x for z in y.split(',')))
df = df.groupby(['A','B'], sort=False)[['C','D']].agg(f).reset_index()
print (df)
    A   B         C                        D
0  R1  J1     D1,D2  S1,S2,S3,S4,S5,S6,S7,S8
1  P1  J2  E1,E2,E3  T1,T2,T3,T4,T5,T6,T7,T8

Performance depends of length of DataFrame, number of groups and length of splitted values. Best test in real data:

np.random.seed(785)

#sample 1k rows
N = 1000

a = [','.join(np.random.choice(list(string.ascii_letters[:random.randint(3, 10)]), random.randint(3, 10))) 
     for _ in range(N)]


df = pd.DataFrame({'A':np.random.randint(20, size=N),
                   'B':np.random.randint(10, size=N),
                   'C':a,
                   'D':sorted(a)})

df = df.sort_values(['A','B'])


In [36]: %timeit df.groupby(['A','B'])[['C','D']].agg(lambda x: ','.join(x.str.split(',').explode().unique())).reset_index()
154 ms ± 1.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [37]: %timeit df.groupby(['A','B'], sort=False)[['C','D']].agg(lambda x: ','.join(dict.fromkeys(z for y in x for z in y.split(',')))).reset_index()
11.4 ms ± 426 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Another sample:

np.random.seed(785)

#5k rows
N = 5000

a = [','.join(np.random.choice(list(string.ascii_letters[:random.randint(3, 10)]), random.randint(3, 10))) 
     for _ in range(N)]


df = pd.DataFrame({'A':np.random.randint(200, size=N),
                   'B':np.random.randint(10, size=N),
                   'C':a,
                   'D':sorted(a)})

df = df.sort_values(['A','B'])
print (df.head(10))


In [43]: %timeit df.groupby(['A','B'])[['C','D']].agg(lambda x: ','.join(x.str.split(',').explode().unique())).reset_index()
1.47 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [44]: %timeit df.groupby(['A','B'], sort=False)[['C','D']].agg(lambda x: ','.join(dict.fromkeys(z for y in x for z in y.split(',')))).reset_index()
56.9 ms ± 518 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 1

U13-Forward
U13-Forward

Reputation: 71580

Try using str.split to split on the commas, then use explode to flatten the Series and use unique to remove the duplicates:

df.groupby(['A','B'])[['C','D']].agg(lambda x: ','.join(x.str.split(',').explode().unique())).reset_index()

Output:

    A   B         C                        D
0  P1  J2  E1,E2,E3  T1,T2,T3,T4,T5,T6,T7,T8
1  R1  J1     D1,D2  S1,S2,S3,S4,S5,S6,S7,S8

Upvotes: 2

Related Questions