Nick
Nick

Reputation: 387

Specific Group By in Pandas

I have a pandas dataframe in this format:

 id    Role  time  text        
 id1   A     t1   text1        
 id1   B     t2   text2        
 id1   B     t3   text3        
 id1   A     t4   text4        
 id1   B     t5   text5        
 id1   A     t6   text6        
 id1   A     t7   text7        
 id2   A     t8   text8        
 id2   B     t9   text9        
 id2   B     t10  text1        
 id2   B     t11  text10       
 id2   A     t12  text11       
 id2   A     t13  text12   

I would like to form a dataframe or even a text file like this:

 id    Role  text                     
 id1   A     text1                    
 id1   B     text2, text3             
 id1   A     text4                    
 id1   B     text5                    
 id1   A     text6, text7             
 id2   A     text8                    
 id2   B     text9, text10, text11    
 id2   A     text12, text13           
 id2   B     text11                   
 id2   A     text12, text13           

Or in the text format:

text1
text2, text3
text4
text5
text6, text7
==NEXT ID==
text8
text9, text10, text11
text12, text13
text11
text12, text13

I've used group by in pandas but I can't get it done since group by Role will put al texts together. Thanks.

Upvotes: 2

Views: 148

Answers (1)

piRSquared
piRSquared

Reputation: 294278

check if 'Role' is equal to its prior value and do a cumulative sum to simulate a third grouping.

g3 = (df.Role != df.Role.shift()).cumsum().rename('clump')
df.groupby(['id', 'Role', g3], sort=False).text.apply(', '.join).reset_index()

    id Role  clump                  text
0  id1    A      1                 text1
1  id1    B      2          text2, text3
2  id1    A      3                 text4
3  id1    B      4                 text5
4  id1    A      5          text6, text7
5  id2    A      5                 text8
6  id2    B      6  text9, text1, text10
7  id2    A      7        text11, text12

To dump this to text files

g3 = (df.Role != df.Role.shift()).cumsum().rename('clump')
d1 = df.groupby(['id', 'Role', g3], sort=False).text.apply(', '.join).reset_index()

for n, g in d1.groupby('id').text:
    print(g)
    print()
    # g.to_csv('{}.csv'.format(n))

0           text1
1    text2, text3
2           text4
3           text5
4    text6, text7
Name: text, dtype: object

5                   text8
6    text9, text1, text10
7          text11, text12
Name: text, dtype: object

Upvotes: 3

Related Questions