Reputation: 820
I have a list of voyages with a start and end date and the earnings for that voyage. I want to calculate the monthly earnings, but I am not sure how I can do that using Pandas:
'2016-02-28 07:30:00', '2016-04-30 00:00:00', '600000'
'2016-05-18 10:30:00', '2016-07-12 02:19:00', '700000'
The way I manually do this is calculating how many days of the voyage is in each respective month and multiply by earnings/total length of voyage.
Upvotes: 2
Views: 213
Reputation: 863166
You need check how many hours is in each date range - in each row. So use DataFrame.apply
with custom function, where groupby
by months
in date_range
and aggreagate size
.
print (df)
start end price
0 2016-02-28 07:30:00 2016-04-30 00:00:00 600000
1 2016-05-18 10:30:00 2016-07-12 02:19:00 700000
print (df.dtypes)
start datetime64[ns]
end datetime64[ns]
price int64
dtype: object
def f(x):
rng = pd.date_range(x.start, x.end, freq='H')
return rng.to_series().groupby([rng.month]).size()
df1 = df.apply(f, axis=1)
print (df1)
2 3 4 5 6 7
0 41.0 744.0 696.0 NaN NaN NaN
1 NaN NaN NaN 326.0 720.0 266.0
Then get price_per_hour
by divide column price
by sum
of all hours:
price_per_hour = df.price / df1.sum(axis=1)
print (price_per_hour)
0 405.131668
1 533.536585
dtype: float64
And last multiple by mul
all hours in each month
:
print (df1.mul(price_per_hour, axis=0))
2 3 4 5 6 \
0 16610.398379 301417.960837 281971.640783 NaN NaN
1 NaN NaN NaN 173932.926829 384146.341463
7
0 NaN
1 141920.731707
#check sum - it is correctly price
print (df1.mul(price_per_hour, axis=0).sum(axis=1))
0 600000.0
1 700000.0
dtype: float64
You can also count prices
per days
- change freq='h'
to freq='D'
, but I think it is less accurate:
def f(x):
rng = pd.date_range(x.start, x.end, freq='D')
return rng.to_series().groupby([rng.month]).size()
df1 = df.apply(f, axis=1)
print (df1)
2 3 4 5 6 7
0 2.0 31.0 29.0 NaN NaN NaN
1 NaN NaN NaN 14.0 30.0 11.0
price_per_hour = df.price / df1.sum(axis=1)
print (price_per_hour)
0 9677.419355
1 12727.272727
dtype: float64
print (df1.mul(price_per_hour, axis=0))
2 3 4 5 6 7
0 19354.83871 300000.0 280645.16129 NaN NaN NaN
1 NaN NaN NaN 178181.818182 381818.181818 140000.0
0 600000.0
1 700000.0
dtype: float64
print (df1.mul(price_per_hour, axis=0).sum(axis=1))
0 600000.0
1 700000.0
dtype: float64
Another solution with reshaping by melt
, groupby and resample resample
- also need groupby
by months
and aggreagate size
:
df['count'] = df.index
df1 = pd.melt(df, id_vars=['price', 'count'], value_name='dates')
print (df1)
price count variable dates
0 600000 0 start 2016-02-28 07:30:00
1 700000 1 start 2016-05-18 10:30:00
2 600000 0 end 2016-04-30 00:00:00
3 700000 1 end 2016-07-12 02:19:00
df2 = df1.set_index('dates').groupby('count').resample('D').size()
print (df2)
count dates
0 2016-02-28 1
2016-02-29 0
2016-03-01 0
2016-03-02 0
2016-03-03 0
2016-03-04 0
2016-03-05 0
2016-03-06 0
2016-03-07 0
2016-03-08 0
2016-03-09 0
2016-03-10 0
2016-03-11 0
2016-03-12 0
...
...
print (df2.index.get_level_values('dates').month)
[2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5
5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7
7 7 7 7 7 7 7 7]
df3 = df2.groupby([df2.index.get_level_values('count'),
df2.index.get_level_values('dates').month]).size().unstack()
print (df3)
2 3 4 5 6 7
count
0 2.0 31.0 30.0 NaN NaN NaN
1 NaN NaN NaN 14.0 30.0 12.0
price_per_hour = df.price / df3.sum(axis=1)
print (price_per_hour)
0 9523.809524
1 12500.000000
dtype: float64
print (df3.mul(price_per_hour, axis=0))
2 3 4 5 6 \
count
0 19047.619048 295238.095238 285714.285714 NaN NaN
1 NaN NaN NaN 175000.0 375000.0
7
count
0 NaN
1 150000.0
print (df3.mul(price_per_hour, axis=0).sum(axis=1))
count
0 600000.0
1 700000.0
dtype: float64
Upvotes: 2