Anonymous
Anonymous

Reputation: 113

Join strings in array column group by another array column in pandas dataframe

My dataframe contains three array columns having same length

| col_1                   | col_2                   | col_3                        |
|-------------------------|-------------------------|------------------------------|
|['a', 'b', 'c', 'd', 'e']|['n', 'n', 'n', 'd', 'd']|['cl', 'cl', 'cl', 'en', 'en']|
|['d', 'e', 'f', 'g', 'h']|['d', 'd', 'd', 'n', 'n']|['de', 'de', 'de', 'me', 'me']|
|['a', 'd', 'e', 'b', 'c']|['p', 'p', 'c', 'c', 'c']|['it', 'it', 'cs', 'cs', 'cs']|

I want to join the strings in col_1 using group by in col_3. The corresponding string should be in col_2. My output dataframe should like this.

| col_1       | col_2    | col_3      |
|-------------|----------|------------|
|['abc', 'de']|['n', 'd']|['cl', 'en']|
|['def', 'gh']|['d', 'n']|['de', 'me']|
|['ad', 'ebc']|['p', 'c']|['it', 'cs']|

Final dataframe columns should have same length

Upvotes: 1

Views: 80

Answers (1)

mozway
mozway

Reputation: 261860

You can explode and use a double groupby.agg:

out = (df
   .explode(list(df)) # explode all columns, use a specific list if needed
   .reset_index()
   .groupby(['index', 'col_3'])
   .agg(''.join)
   .reset_index('col_3')
   .groupby(level=0).agg(list)
   [df.columns]
)

output:

           col_1      col_2     col_3
index                                
0      [abc, de]  [nnn, dd]  [cl, en]
1      [def, gh]  [ddd, nn]  [de, me]
2      [ebc, ad]  [ccc, pp]  [cs, it]

Or maybe you want to group by both col_2/col_3:

out = (df
   .explode(list(df))
   .reset_index()
   .groupby(['index', 'col_2', 'col_3'])
   .agg(''.join)
   .reset_index(['col_2', 'col_3'])
   .groupby(level=0).agg(list)
   [df.columns]
)

output:

           col_1   col_2     col_3
index                             
0      [de, abc]  [d, n]  [en, cl]
1      [def, gh]  [d, n]  [de, me]
2      [ebc, ad]  [c, p]  [cs, it]

Used input:

df = pd.DataFrame({'col_1': [['a', 'b', 'c', 'd', 'e'],
                             ['d', 'e', 'f', 'g', 'h'], 
                             ['a', 'd', 'e', 'b', 'c']],
                   'col_2': [['n', 'n', 'n', 'd', 'd'],
                             ['d', 'd', 'd', 'n', 'n'], 
                             ['p', 'p', 'c', 'c', 'c']],
                   'col_3': [['cl', 'cl', 'cl', 'en', 'en'], 
                             ['de', 'de', 'de', 'me', 'me'],
                             ['it', 'it', 'cs', 'cs', 'cs']]})

Upvotes: 1

Related Questions