NewGuy
NewGuy

Reputation: 3413

How can I keep all the data fields when using groupby in a pandas dataframe?

I'm gathering some statistics from about contributions to my organization's repositories. The final dataframe looks like this:

repository          developer  commits   adds  deletes first_commit_week last_commit_week
     repo1               dev1        1     57       12        2021-01-10       2021-01-10
     repo1               dev2        5     71       58        2020-08-23       2020-09-27
     repo1               dev3       10    107       16        2020-08-09       2020-10-18
     repo2               dev1       25    712     1184        2020-06-28       2020-12-13
     repo2               dev3       51   1141      236        2021-01-10       2021-05-09
     repo2               dev4        4      9        7        2021-01-10       2021-05-09
     repo2               dev5       10    121      242        2020-08-09       2021-02-07
     repo2               dev6       75   4319     8881        2020-07-19       2021-04-25
     repo3               dev1       21   3008     6391        2019-09-08       2021-05-09
     repo3               dev3        5     51       70        2019-09-08       2021-04-25
     repo3               dev6        1     14        1        2020-06-28       2020-06-28

After that, I'm grouping the dataframe and printing it like so:

df_groupby_repository_developer = stats_df.groupby(["repository", "developer"])
print(df_groupby_repository_developer.sum().to_string())

Which results in this output

                               commits    adds  deletes
repository developer                                   
repo1           dev1                 1      57       12
                dev2                 5      71       58
                dev3                10     107       16
repo2           dev1                25     712     1184
                dev3                51    1141      236
                dev4                 4       9        7
                dev5                10     121      242
                dev6                75    4319     8881
repo3           dev1                21    3008     6391 
                dev3                 5      51       70
                dev6                 1      14        1

With this I have a nice visual output. It shows the groups very nicely without the extra repository names for everything. However, it loses the two date fields.

I realize this is due to sum(), but I'd like the nice visual output that I have with the groupby but also have the first and last commit dates that are in the original dataframe. How can I accomplish this?

Is there a better way to accomplish this? groupby feels like the wrong way to go about this, now that I've played with it to get this far.

Upvotes: 0

Views: 520

Answers (2)

joao
joao

Reputation: 2293

Your data never has more than one (repo, dev) couple, so I'm wondering whether you really want to do a sum, or if you're just using it because groupby requires an aggregation function. If you don't really need to sum, then you can use the identity function as the aggregation function, this works for all your columns, including dates:

In [14]: stats_df.groupby(['repository', 'developer']).agg(lambda x: x)
Out[14]: 
                      commits  adds  deletes first_commit_week last_commit_week
repository developer                                                           
repo1      dev1             1    57       12        2021-01-10       2021-01-10
           dev2             5    71       58        2020-08-23       2020-09-27
           dev3            10   107       16        2020-08-09       2020-10-18
repo2      dev1            25   712     1184        2020-06-28       2020-12-13
           dev3            51  1141      236        2021-01-10       2021-05-09
           dev4             4     9        7        2021-01-10       2021-05-09
           dev5            10   121      242        2020-08-09       2021-02-07
           dev6            75  4319     8881        2020-07-19       2021-04-25
repo3      dev1            21  3008     6391        2019-09-08       2021-05-09
           dev3             5    51       70        2019-09-08       2021-04-25
           dev6             1    14        1        2020-06-28       2020-06-28

Upvotes: 0

Utsav
Utsav

Reputation: 5918

stats_df.groupby(["repository", "developer"]).agg({
    'commits':'sum',
    'adds':'sum',
    'deletes':'sum',
    'first_commit_week':'first',
    'last_commit_week':'last'
})

Output

                     commits. adds  deletes first_commit_week   last_commit_week
repository  developer                   
repo1   dev1          1     57      12      2021-01-10      2021-01-10
        dev2          5     71      58      2020-08-23      2020-09-27
        dev3          10    107     16      2020-08-09  2020-10-18
repo2   dev1          25    712     1184    2020-06-28  2020-12-13
        dev3          51    1141    236     2021-01-10  2021-05-09
        dev4          4     9       7       2021-01-10  2021-05-09
        dev5          10    121     242     2020-08-09  2021-02-07
        dev6          75    4319    8881    2020-07-19  2021-04-25
repo3   dev1          21    3008    6391    2019-09-08  2021-05-09
        dev3           5    51      70      2019-09-08  2021-04-25
        dev6           1    14      1       2020-06-28  2020-06-28

Upvotes: 2

Related Questions