S.Zhong
S.Zhong

Reputation: 101

Pandas groupby concat ungrouped column into comma separated string

I have the following example df:

        col1     col2    col3   doc_no
     0     a        x       f        0
     1     a        x       f        1
     2     b        x       g        2
     3     b        y       g        3
     4     c        x       t        3
     5     c        y       t        4
     6     a        x       f        5
     7     d        x       t        5
     8     d        x       t        6

I want to group by the first 3 columns (col1, col2, col3), concatenate the fourth column (doc_no) into a line of strings based on the groupings of the first 3 columns, as well as also generate a sorted count column of the 3 column grouping (count). Example desired output below (column order doesn't matter):

        col1     col2    col3   count      doc_no
     0     a        x       f       3     0, 1, 5
     1     d        x       t       2        5, 6
     2     b        x       g       1           2
     3     b        y       g       1           3
     4     c        x       t       1           3
     5     c        y       t       1           4

How would I go about doing this? I used the below line to get just the grouping and the count:

grouped_df = df.groupby(['col1','col2','col3']).size().reset_index(name='count')\
    .sort_values(['count'], ascending=False).reset_index()

But I'm not sure how to also get the concatenated doc_no column in the same code line.

Upvotes: 7

Views: 2083

Answers (3)

wwnde
wwnde

Reputation: 26676

Another way

df2=df.groupby(['col1','col2','col3']).doc_no.agg(doc_no=('doc_no',list)).reset_index()
df2['doc_no']=df2['doc_no'].astype(str).str[1:-1]

enter image description here

Upvotes: 1

BENY
BENY

Reputation: 323316

Let us do

df.doc_no=df.doc_no.astype(str)
s=df.groupby(['col1','col2','col3']).doc_no.agg(['count',','.join]).reset_index()
s
  col1 col2 col3  count   join
0    a    x    f      3  0,1,5
1    b    x    g      1      2
2    b    y    g      1      3
3    c    x    t      1      3
4    c    y    t      1      4
5    d    x    t      2    5,6

Upvotes: 4

cs95
cs95

Reputation: 402814

Try groupby and agg like so:

(df.groupby(['col1', 'col2', 'col3'])['doc_no']
   .agg(['count', ('doc_no',  lambda x: ','.join(map(str, x)))])
   .sort_values('count', ascending=False)     
   .reset_index())

  col1 col2 col3  count doc_no
0    a    x    f      3  0,1,5
1    d    x    t      2    5,6
2    b    x    g      1      2
3    b    y    g      1      3
4    c    x    t      1      3
5    c    y    t      1      4

agg is simple to use because you can specify a list of reducers to run on a single column.

Upvotes: 10

Related Questions