Reputation: 4921
For a dataset DF, I would like to add temperatures taken from a secondary dataset TEMP.
TEMP does not have all dates. Missing dates need to be interpolated in such a way that the latest available value is filled-in for missing values. For ex. if temperature is missing for 2019-2-20, but it was available for 2019-2-19, it will fill-in this value. This can be done by using pd.DataFrame(x.asfreq('D')).ffill().reset_index(drop=False)
.
When there are multiple temperature measures per day, a weighted average should be applied.
In the current code, dates are pre-selected using isin
to limit datasize.
The code works, but is not optimal if data become big, paticularly if temperature data from 1000s of places and dates need to be filled-in and merged to DF. I am searching for a better solution regarding timing / memory, for ex. based on itertools, apply, generator expression, or anything else.
Below I show a fictive small reproducible example with code.
Modules:
import numpy as np
import pandas as pd
Fictive data:
DF = pd.DataFrame({'date': ['2019-01-01', '2019-01-11', '2019-01-13', '2019-01-14', '2019-01-22', '2019-02-14'], 'place':['A', 'A','A','A','B','C']})
TEMP = pd.DataFrame({'date':['2019-01-10', '2019-01-14', '2019-01-20', '2019-01-20', '2019-01-22', '2019-01-23', '2019-01-24', '2019-02-14', '2019-02-14'], 'place':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'], 'temp': [5, 7, 9, 10, 10, 2, 0, 10, 11], 'quantity': [3,5,2,1,1,2,2,1,3]})
DF['date'] = pd.to_datetime(DF['date'])
TEMP['date'] = pd.to_datetime(TEMP['date'])
The code:
if 'temp' in DF.columns:
del DF['temp']
else: print('No variable temp found in DF')
def filltemps(dat1, dat2):
"""dat1: TEMP
dat2: DF"""
global totmp
places = dat2['place'].unique()
mx=len(places)
totmp = pd.DataFrame(columns=['date', 'temp', 'place'])
for i in range(mx):
mp=[]
dd1=[]
nsp = pd.DataFrame(dat1[ (dat1['place']==places[i]) ])
nsp = nsp[['date', 'quantity', 'temp']]
prod = lambda w,z: w*z
nsp['sumn'] = prod(nsp['temp'], nsp['quantity'])
wavg = lambda y,x: y/x
c3 = wavg(nsp.groupby('date')['sumn'].agg('sum'), nsp.groupby('date')['quantity'].agg('sum'))
mp = pd.DataFrame(c3.asfreq('D')).ffill().reset_index(drop=False)
mp.columns = ['date', 'temp']
mp['place'] = np.array([places[i]] * len(mp))
mp['date'] = pd.to_datetime(mp['date'])
dd1 = dat2.loc[dat2['place']==places[i], ['date']]
mp = mp[ mp['date'].isin(list(pd.to_datetime(dd1['date']))) ]
totmp = pd.concat([totmp, mp])
return totmp
Timing by %timeit TEMP2 = filltemps(TEMP, DF)
shows 116 ms ± 401 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In TEMP2, it can be seen that as expected a weighted average was computed for place C:
TEMP2 = filltemps(TEMP, DF)
TEMP2 will be merged with the original DF. Note that there can be missing values if no earlier data were available.
DF = pd.merge(DF, TEMP2, how='left', on=['date', 'place'])
DF, the expected outcome, should look like this:
Your pointers and help are very much appreciated!
Upvotes: 1
Views: 237
Reputation: 1286
My solution is similay to the one ob Vaishali but I want to point out one pitfall of asfreq
.
Let's start at the beginning. We calculate the weighted averages:
TEMP.groupby(['date', 'place']).apply(lambda x: np.round(np.average(x['temp'],weights=x['quantity']), 2)).reset_index(name = 'temp')
Now we calculate the full date range:
rng = pd.date_range(DF.date.min(), DF.date.max(), freq='D')
We use this data range to reindex the temperatures:
TEMP = TEMP.groupby('place')['date', 'temp'].apply(lambda x: x.set_index('date').reindex(rng).ffill()).reset_index().rename(columns={'level_1': 'date'})
In contrast to as_freq
we will now also handle the case that the temperature time series is "shorter" then the location series.
Finally we can put everything together:
DF.merge(TEMP, on=['date', 'place'])
If we sligthly modify the input by changing the last date:
DF = pd.DataFrame({'date': ['2019-01-01', '2019-01-11', '2019-01-13', '2019-01-14', '2019-01-22', '2019-02-15'], 'place':['A', 'A','A','A','B','C']})
the solution of Ruthger delivers:
date place temp
1 2019-01-11 A 5.0
3 2019-01-13 A 5.0
4 2019-01-14 A 7.0
0 2019-01-22 B 10.0
Vaishali:
date place temp
0 2019-01-01 A NaN
1 2019-01-11 A 5.0
2 2019-01-13 A 5.0
3 2019-01-14 A 7.0
4 2019-01-22 B 10.0
5 2019-02-15 C NaN
Joerg:
0 2019-01-01 A NaN
1 2019-01-11 A 5.00
2 2019-01-13 A 5.00
3 2019-01-14 A 7.00
4 2019-01-22 B 10.00
5 2019-02-15 C 10.75
Briefly the terms:
%timeit Ruthger(TEMP, DF)
526 ms ± 36.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit Vaishali(TEMP, DF)
12.1 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit Joerg(TEMP, DF)
14.9 ms ± 872 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 0
Reputation: 38415
Calculate weighted average in TEMP. Fill missing dates in TEMP by place and merge the resulting dataframe with DF.
TEMP.date = pd.to_datetime(TEMP.date)
TEMP = TEMP.groupby(['date', 'place']).apply(lambda x: np.round(np.average(x['temp'],weights=x['quantity']), 2)).reset_index(name = 'temp')
DF.merge(TEMP.set_index('date').groupby('place')['temp'].apply(lambda x: x.asfreq('D', method='ffill')).reset_index(), 'left')
date place temp
0 2019-01-01 A NaN
1 2019-01-11 A 5.00
2 2019-01-13 A 5.00
3 2019-01-14 A 7.00
4 2019-01-22 B 10.00
5 2019-02-14 C 10.75
Upvotes: 1
Reputation: 186
Here's a replacement for filltemps
that works 10x faster on my machine. It also uses apply
, as requested :)
def interpolate_temps(temp_data, other_data):
""" Interpolate temperature by place and location on a daily frequency.
Parameters
----------
temp_data : pd.DataFrame
with columns:
date : datetime
place : categorical
temp : numerical
quantity : numerical
other_Data : pd.DataFrame
with columns:
date : datetime
place : categorical
Returns
-------
pd.DataFrame
With date and place in the index and a temp column
"""
def weighted_average(df, col1, col2):
"""Average of col1 weighted by col2"""
return (df[col1] * df[col2]).sum() / df[col2].sum()
results = []
for place in temp_data.place.unique():
dates_needed = other_data[other_data.place == place].date.unique()
result = (temp_data[temp_data.place==place]
.groupby('date')
.apply(weighted_average, 'temp', 'quantity')
.reindex(dates_needed, method='ffill')
.to_frame('temp')
.assign(place=place)
.set_index('place', append=True)
)
results.append(result)
return pd.concat(results)
It follows the same basic structure as yours:
Iterate through each place.
Take the weighted average of temperature by dates.
Fill in the missing values with ffill
.
Add place back in as a column.
Join the results for each place
The major advance in time is downselecting to the dates needed for a particular place before running ffill
. This should also make it more memory-efficient, as should the fact that we're not making so many named intermediate objects. The biggest place where this was an issue in your code was totmp = pd.concat([totmp, mp])
, where you created len(place)
different versions of totmp
, whereas in interpolate_temps
we wait until the end to pd.concat(results)
all at once.
The result of interpolate_temps
has date
and place
in the index, so you can join it like this:
DF.join(interpolate_temps(TEMP, DF), on=['date', 'place'])
or reset the index and merge like you had been doing.
The only other advice I have is to use more verbose variable names. I never really figured out what many of them meant, and that made it harder to understand your code.
Upvotes: 1