Reputation: 1637
I'm a newbie and I am sure there's an easy way to do this that I don't know about, thank you in advance for your help.
I've got the historical min and max sales numbers for the highest and lowest sales team performer on each day of the year for the last 10 years. I've also got the same thing for 2016 (example data below).
hist_min hist_max 2016_min 2016_max
Day_of_Year
1 1000 10000 898 9000
2 234 896 300 1000
3 1254 23666 1000 24000
4 930 78999 1000 1050
5 278 74588 300 5000
The end goal is to chart this data in matplotlib but I only want to get the vales from the 2016_min column that is below the value in the hist_min, and similarly only have values for the 2016_max columns that are more than the hist_max column. The data would look like this:
hist_min hist_max 2016_min 2016_max
Day_of_Year
1 1000 10000 898 NULL
2 234 896 NULL 1000
3 1254 23666 1000 24000
4 930 78999 NULL NULL
5 278 74588 NULL NULL
I put 'NULL' in there to represent the empty values, Nan might be better but I don't know if matplot lib can handle 'Nan' numbers or not... that's the next step so I'll find out soon enough.
Thank you in advance for your help, Me
Upvotes: 2
Views: 5575
Reputation: 339230
You can index a dataframe by a condition,
df1 = df[df["2016_max"] > df["hist_max"]]
This can then easily plotted with matplotlib.
u = u"""Day_of_Year hist_min hist_max 2016_min 2016_max
1 1000 10000 898 9000
2 234 896 300 1000
3 1254 23666 1000 24000
4 930 78999 1000 1050
5 278 74588 300 5000"""
import io
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv(io.StringIO(u), index_col=0, delim_whitespace=True)
df1 = df[df["2016_max"] > df["hist_max"]]
df2 = df[df["2016_min"] < df["hist_min"]]
fig, ax = plt.subplots()
ax.scatter(df1.index, df1["2016_max"], label="max. 2016")
ax.scatter(df2.index, df1["2016_min"], label="min. 2016")
plt.legend()
plt.show()
Upvotes: 1
Reputation: 862641
Use mask
with boolean masks which return NaN
,NULL
or None
if True
s:
print (df['2016_min'] > df['hist_min'])
Day_of_Year
1 False
2 True
3 False
4 True
5 True
dtype: bool
df['2016_min'] = df['2016_min'].mask(df['2016_min'] > df['hist_min'])
df['2016_max'] = df['2016_max'].mask(df['2016_max'] < df['hist_max'])
print (df)
hist_min hist_max 2016_min 2016_max
Day_of_Year
1 1000 10000 898.0 NaN
2 234 896 NaN 1000.0
3 1254 23666 1000.0 24000.0
4 930 78999 NaN NaN
5 278 74588 NaN NaN
df['2016_min'] = df['2016_min'].mask(df['2016_min'] > df['hist_min'], 'NULL')
df['2016_max'] = df['2016_max'].mask(df['2016_max'] < df['hist_max'], 'NULL')
print (df)
hist_min hist_max 2016_min 2016_max
Day_of_Year
1 1000 10000 898 NULL
2 234 896 NULL 1000
3 1254 23666 1000 24000
4 930 78999 NULL NULL
5 278 74588 NULL NULL
df['2016_min'] = df['2016_min'].mask(df['2016_min'] > df['hist_min'], None)
df['2016_max'] = df['2016_max'].mask(df['2016_max'] < df['hist_max'], None)
print (df)
hist_min hist_max 2016_min 2016_max
Day_of_Year
1 1000 10000 898 None
2 234 896 None 1000
3 1254 23666 1000 24000
4 930 78999 None None
5 278 74588 None None
Upvotes: 1