Reputation: 47
I want to plot line graph of yearly usage of {a, b} pairs.
X axis should be year, and y axis should be usage.
Here is my dataset.
a b year usage
a2 10104 2011 7.292787
a0 10104 2012 43.545533
b0 12011 2009 96.130359
b1 12011 2009 7.658487
b1 14102 2010 6.975572
Upvotes: 2
Views: 177
Reputation: 862591
Use if no duplicates in pairs a, b
vs year
:
s = df.set_index(['year','a','b'])['usage']
.unstack([1,2])
.sort_index(axis=1)
.interpolate('index')
s.index = s.index.astype(str)
s.plot()
Upvotes: 1
Reputation: 294258
Use pivot_table
to rearrange data such that I have combinations of a
and b
in the columns and year
in the index.
However, there may be missing data once we've pivoted. By using interpolate
with the index
parameter, we fill in missing data with the interpolated values while maintaining proportionality with the relative changes in the index value. This matters when the index values are not uniformly separated, eg [2005, 2006, 2008]
. If the index is uniformly separated, then using index
will not harm as it would be the same as the default.
d1 = df.pivot_table('usage', 'year', ['a', 'b']).interpolate('index')
d1.index = pd.PeriodIndex(d1.index, freq='A')
d1.plot()
Response to @jezrael's comments
It is beneficial to observe what the data looks like without interpolate
d1 = df.pivot_table('usage', 'year', ['a', 'b'])
d1.index = pd.PeriodIndex(d1.index, freq='A')
d1
a a0 a2 b0 b1
b 10104 10104 12011 12011 14102
year
2009 NaN NaN 96.130359 7.658487 NaN
2010 NaN NaN NaN NaN 6.975572
2011 NaN 7.292787 NaN NaN NaN
2012 43.545533 NaN NaN NaN NaN
In this case, we don't see what could happen. But I imagined that there was much more data where we'd see gaps of NaN
between sparse data. Instead of forward filling, we can fill proportionately.
d1 = df.pivot_table('usage', 'year', ['a', 'b']).interpolate('index')
d1.index = pd.PeriodIndex(d1.index, freq='A')
d1
a a0 a2 b0 b1
b 10104 10104 12011 12011 14102
year
2009 NaN NaN 96.130359 7.658487 NaN
2010 NaN NaN 96.130359 7.658487 6.975572
2011 NaN 7.292787 96.130359 7.658487 6.975572
2012 43.545533 7.292787 96.130359 7.658487 6.975572
Now again, this would show its benefit only with more data. As with this data, we only get to observe limited forward filling.
Regarding mean
Note that I use pivot_table
and one of the things pivot_table
handles is if there are duplicates within the index/column
specification. It will aggregate all values that fall into that cell with the aggfunc
. By default that function is mean
. However, in this case, per the data that has been presented, there are no duplicate a/b/year
combinations. As such, no aggregation took place.
Upvotes: 3