Reputation: 990
Terrible title, but Here goes. I have a 13,000 x 91 dataframe. 26 of the columns are numeric. The rows are individual projects, with project performance split by year. Like this:
| Year | Control | Description | USD_Cost | USD_Profit |
|------|---------|---------------|----------|------------|
| 1991 | A1 | A description | 1 | 2 |
| 1992 | A1 | A Description | 100 | 300 |
| 1991 | B1 | B Description | 3 | 50 |
| 1995 | C1 | C Description | 5 | 10 |
| 1990 | D1 | D Description | 2 | 1 |
| 1996 | D1 | D Description | 1 | 1 |
Rather than recording how each project performed in each particular year, I just want to record how long each project lasted, and the overall project performance:
| Years | Control | Description | USD_Cost | USDProfit |
|-------|---------|---------------|----------|-----------|
| 2 | A1 | A description | 101 | 302 |
| 1 | B1 | B Description | 3 | 50 |
| 1 | C1 | C Description | 5 | 10 |
| 2 | D1 | D Description | 3 | 2 |
Control
and Description
aren't changed, but the numeric columns beginning with USD are summed across rows. And there are 26 USD columns for different performance aspects. There are about 8000 unique Control Numbers but 13000 total Year-ControlNumber combos.
I know how to groupby for one element (e.g. print(dft.groupby(['Control'])['USD_Cost', 'USD_Profit'].sum() )
but when I do that, I think I am losing all the non-numeric columns. Also, I would like to avoid typing in the names of all 26 USD columns.
Can this be done with groupby?
Upvotes: 0
Views: 769
Reputation: 13666
This is a really common kind of operation, and pandas
has an elegant way of doing it. To avoid a tedious task of replicating 26 summation functions, I use dictionary comprehension.
First you define a dictionary of actions by column, then you use agg
function:
df = pd.DataFrame({
"Year": [1991, 1992, 1991, 1995, 1990, 1996],
"Control": ["A1", "A1", "B1", "C1", "D1", "D1"],
"Description": [
"A description",
"A description",
"B description",
"C description",
"D description",
"D description"
],
"USD_Cost": [1, 100, 3, 5, 2, 1],
"USD_Profit": [2, 300, 50, 10, 1, 1],
})
actions = {'Year': pd.Series.nunique,
'Description': lambda x: x.iloc[0]}
actions.update({x: sum for x in df.columns if x.startswith('USD_')})
df.groupby('Control').agg(actions).reset_index()
And this provides
Control Year Description USD_Cost USD_Profit
0 A1 2 A description 101 302
1 B1 1 B description 3 50
2 C1 1 C description 5 10
3 D1 2 D description 3 2
Upvotes: 1
Reputation: 192
So my solution would be to group by "Control" and then apply a function to each group, which takes all non-numeric data from the first row (I'm assuming all rows are the same for non-numeric data), but takes the sum for all numeric data. Years are treated separately since you don't want to sum the number of years.
My code:
import pandas as pd
import numpy as np
def sum_project(project):
# Since only numeric data and years are different,
# we just take the first row
project_summed = project.iloc[0, :]
# sum all numeric data but exclude "Year"
cols_numeric = project.select_dtypes([np.number]).columns
cols_numeric = cols_numeric.drop(["Year"])
project_summed[cols_numeric] = project[cols_numeric].sum()
# Get year number
project_summed["Year"] = len(project)
return project_summed
df = pd.DataFrame({
"Year": [1991, 1992, 1991, 1995, 1990, 1996],
"Control": ["A1", "A1", "B1", "C1", "D1", "D1"],
"Description": [
"A description",
"A description",
"B description",
"C description",
"D description",
"D description"
],
"USD_Cost": [1, 100, 3, 5, 2, 1],
"USD_Profit": [2, 300, 50, 10, 1, 1],
})
findal_df = df.groupby(["Control"]).apply(sum_project)
this gives final_df:
Year Control Description USD_Cost USD_Profit
Control
A1 2 A1 A description 101 302
B1 1 B1 B description 3 50
C1 1 C1 C description 5 10
D1 2 D1 D description 3 2
Upvotes: 1
Reputation: 61
I think this should work for you
columns = list(filter(lambda x: 'USD' in x, df.columns))
df.groupby(['Control', 'Description'])[columns].sum()
That would bring you all columns grouped by Control, Description. This wouldn't be a problem for what you're up to, so I think this would be the best way.
Upvotes: 1