Reputation: 829
I want to find the states with the highest average total revenue and be able to see states with the 40-45th highest average, 35-40th, etc for all states from 1992-2016.
Data is organized in a dataframe in the below picture. So ideally I could have another column like the following. I think this is what I am trying to do.
STATE // YEAR // TOTAL_REVENUE // AVG_TOTAL_REVENUE
ALABAMA // 1992 // 5000 // 6059
ALABAMA // 1993 // 4000 // 6059
ALASKA // 1992 // 3000 // 2059
ALABAMA // 1996 // 6019 // 6059
Is this possible to do? I am not sure if I am stating what I want to do correctly and not sure what I am looking for google wise to figure out a way forward.
Upvotes: 1
Views: 798
Reputation: 2759
Assuming your input looks like:
STATE YEAR TOTAL_REVENUE
Michigan 2001 1000
Michigan 2002 2000
California 2003 3000
California 2004 4000
Michigan 2005 5000
Then just do:
df['AVG_TOTAL_REVENUE'] = np.nan
states = df['STATE'].tolist()
states = list(set(states))
for state in states:
state_values = df[df['STATE'] == state]
revenues = state_values['TOTAL_REVENUE'].tolist()
revenues = [float(x) for x in revenues]
avg = sum(revenues)/len(revenues)
df['AVG_TOTAL_REVENUE'].loc[state_values.index] = avg
which gives you:
STATE YEAR TOTAL_REVENUE AVG_TOTAL_REVENUE
0 Michigan 2001 1000 2666.666667
1 Michigan 2002 2000 2666.666667
2 California 2003 3000 3500.000000
3 California 2004 4000 3500.000000
4 Michigan 2005 5000 2666.666667
Upvotes: 6
Reputation: 1514
If your data is stored in a pandas dataframe called df with STATE as index, then you can try:
df.set_index("STATE",inplace=True)
avg_revenue = df.groupby(level=0)["TOTAL_REVENUE"].agg("mean")
df["AVG_TOTAL_REVENUE"] = avg_revenue.loc[df.index]
df = df.sort_values(by="AVG_TOTAL_REVENUE",ascending=False)
Regarding the "40-45th highest average", I'm not sure exactly what you're looking for. But you could do this for instance:
import numpy as np
bin = (np.array([0.40, 0.45]) * len(df)).astype(int)
df.iloc[bin[0]:bin[1],:]
# Or with quantiles
min_q,max_q = (0.40, 0.45)
avg = df.AVG_TOTAL_REVENUE
df.loc[(avg >= avg.quantile(min_q)) & (avg <= avg.quantile(max_q)), :]
Or maybe you want to bin your data every 5 state in order of AVG_TOTAL_REVENUE?
df_grouped = df.groupby("STATE")["AVG_TOTAL_REVENUE"].agg("first")
n_bins = int(df_grouped.shape[0] / 5)
bins = (pd.cut(df_grouped,bins=n_bins)
.reset_index()
.groupby("AVG_TOTAL_REVENUE")
.agg(list)
)
Upvotes: 1