How can I keep all the data fields when using groupby in a pandas dataframe?

Question

I'm gathering some statistics from about contributions to my organization's repositories. The final dataframe looks like this:

repository          developer  commits   adds  deletes first_commit_week last_commit_week
     repo1               dev1        1     57       12        2021-01-10       2021-01-10
     repo1               dev2        5     71       58        2020-08-23       2020-09-27
     repo1               dev3       10    107       16        2020-08-09       2020-10-18
     repo2               dev1       25    712     1184        2020-06-28       2020-12-13
     repo2               dev3       51   1141      236        2021-01-10       2021-05-09
     repo2               dev4        4      9        7        2021-01-10       2021-05-09
     repo2               dev5       10    121      242        2020-08-09       2021-02-07
     repo2               dev6       75   4319     8881        2020-07-19       2021-04-25
     repo3               dev1       21   3008     6391        2019-09-08       2021-05-09
     repo3               dev3        5     51       70        2019-09-08       2021-04-25
     repo3               dev6        1     14        1        2020-06-28       2020-06-28

After that, I'm grouping the dataframe and printing it like so:

df_groupby_repository_developer = stats_df.groupby(["repository", "developer"])
print(df_groupby_repository_developer.sum().to_string())

Which results in this output

                               commits    adds  deletes
repository developer                                   
repo1           dev1                 1      57       12
                dev2                 5      71       58
                dev3                10     107       16
repo2           dev1                25     712     1184
                dev3                51    1141      236
                dev4                 4       9        7
                dev5                10     121      242
                dev6                75    4319     8881
repo3           dev1                21    3008     6391 
                dev3                 5      51       70
                dev6                 1      14        1

With this I have a nice visual output. It shows the groups very nicely without the extra repository names for everything. However, it loses the two date fields.

I realize this is due to sum(), but I'd like the nice visual output that I have with the groupby but also have the first and last commit dates that are in the original dataframe. How can I accomplish this?

Is there a better way to accomplish this? groupby feels like the wrong way to go about this, now that I've played with it to get this far.

Utsav · Accepted Answer

stats_df.groupby(["repository", "developer"]).agg({
    'commits':'sum',
    'adds':'sum',
    'deletes':'sum',
    'first_commit_week':'first',
    'last_commit_week':'last'
})

Output

                     commits. adds  deletes first_commit_week   last_commit_week
repository  developer                   
repo1   dev1          1     57      12      2021-01-10      2021-01-10
        dev2          5     71      58      2020-08-23      2020-09-27
        dev3          10    107     16      2020-08-09  2020-10-18
repo2   dev1          25    712     1184    2020-06-28  2020-12-13
        dev3          51    1141    236     2021-01-10  2021-05-09
        dev4          4     9       7       2021-01-10  2021-05-09
        dev5          10    121     242     2020-08-09  2021-02-07
        dev6          75    4319    8881    2020-07-19  2021-04-25
repo3   dev1          21    3008    6391    2019-09-08  2021-05-09
        dev3           5    51      70      2019-09-08  2021-04-25
        dev6           1    14      1       2020-06-28  2020-06-28

How can I keep all the data fields when using groupby in a pandas dataframe?

Answers (2)

Related Questions