Reputation: 3413
I'm gathering some statistics from about contributions to my organization's repositories. The final dataframe looks like this:
repository developer commits adds deletes first_commit_week last_commit_week
repo1 dev1 1 57 12 2021-01-10 2021-01-10
repo1 dev2 5 71 58 2020-08-23 2020-09-27
repo1 dev3 10 107 16 2020-08-09 2020-10-18
repo2 dev1 25 712 1184 2020-06-28 2020-12-13
repo2 dev3 51 1141 236 2021-01-10 2021-05-09
repo2 dev4 4 9 7 2021-01-10 2021-05-09
repo2 dev5 10 121 242 2020-08-09 2021-02-07
repo2 dev6 75 4319 8881 2020-07-19 2021-04-25
repo3 dev1 21 3008 6391 2019-09-08 2021-05-09
repo3 dev3 5 51 70 2019-09-08 2021-04-25
repo3 dev6 1 14 1 2020-06-28 2020-06-28
After that, I'm grouping the dataframe and printing it like so:
df_groupby_repository_developer = stats_df.groupby(["repository", "developer"])
print(df_groupby_repository_developer.sum().to_string())
Which results in this output
commits adds deletes
repository developer
repo1 dev1 1 57 12
dev2 5 71 58
dev3 10 107 16
repo2 dev1 25 712 1184
dev3 51 1141 236
dev4 4 9 7
dev5 10 121 242
dev6 75 4319 8881
repo3 dev1 21 3008 6391
dev3 5 51 70
dev6 1 14 1
With this I have a nice visual output. It shows the groups very nicely without the extra repository names for everything. However, it loses the two date fields.
I realize this is due to sum()
, but I'd like the nice visual output that I have with the groupby
but also have the first and last commit dates that are in the original dataframe. How can I accomplish this?
Is there a better way to accomplish this? groupby
feels like the wrong way to go about this, now that I've played with it to get this far.
Upvotes: 0
Views: 520
Reputation: 2293
Your data never has more than one (repo, dev) couple, so I'm wondering whether you really want to do a sum
, or if you're just using it because groupby
requires an aggregation function. If you don't really need to sum
, then you can use the identity function as the aggregation function, this works for all your columns, including dates:
In [14]: stats_df.groupby(['repository', 'developer']).agg(lambda x: x)
Out[14]:
commits adds deletes first_commit_week last_commit_week
repository developer
repo1 dev1 1 57 12 2021-01-10 2021-01-10
dev2 5 71 58 2020-08-23 2020-09-27
dev3 10 107 16 2020-08-09 2020-10-18
repo2 dev1 25 712 1184 2020-06-28 2020-12-13
dev3 51 1141 236 2021-01-10 2021-05-09
dev4 4 9 7 2021-01-10 2021-05-09
dev5 10 121 242 2020-08-09 2021-02-07
dev6 75 4319 8881 2020-07-19 2021-04-25
repo3 dev1 21 3008 6391 2019-09-08 2021-05-09
dev3 5 51 70 2019-09-08 2021-04-25
dev6 1 14 1 2020-06-28 2020-06-28
Upvotes: 0
Reputation: 5918
stats_df.groupby(["repository", "developer"]).agg({
'commits':'sum',
'adds':'sum',
'deletes':'sum',
'first_commit_week':'first',
'last_commit_week':'last'
})
Output
commits. adds deletes first_commit_week last_commit_week
repository developer
repo1 dev1 1 57 12 2021-01-10 2021-01-10
dev2 5 71 58 2020-08-23 2020-09-27
dev3 10 107 16 2020-08-09 2020-10-18
repo2 dev1 25 712 1184 2020-06-28 2020-12-13
dev3 51 1141 236 2021-01-10 2021-05-09
dev4 4 9 7 2021-01-10 2021-05-09
dev5 10 121 242 2020-08-09 2021-02-07
dev6 75 4319 8881 2020-07-19 2021-04-25
repo3 dev1 21 3008 6391 2019-09-08 2021-05-09
dev3 5 51 70 2019-09-08 2021-04-25
dev6 1 14 1 2020-06-28 2020-06-28
Upvotes: 2