Ruthger Righart
Ruthger Righart

Reputation: 4921

Filling-in missing data and merge with main set

For a dataset DF, I would like to add temperatures taken from a secondary dataset TEMP. TEMP does not have all dates. Missing dates need to be interpolated in such a way that the latest available value is filled-in for missing values. For ex. if temperature is missing for 2019-2-20, but it was available for 2019-2-19, it will fill-in this value. This can be done by using pd.DataFrame(x.asfreq('D')).ffill().reset_index(drop=False).

When there are multiple temperature measures per day, a weighted average should be applied. In the current code, dates are pre-selected using isin to limit datasize.

The code works, but is not optimal if data become big, paticularly if temperature data from 1000s of places and dates need to be filled-in and merged to DF. I am searching for a better solution regarding timing / memory, for ex. based on itertools, apply, generator expression, or anything else.

Below I show a fictive small reproducible example with code.

Modules:

import numpy as np 
import pandas as pd

Fictive data:

DF = pd.DataFrame({'date': ['2019-01-01', '2019-01-11', '2019-01-13', '2019-01-14', '2019-01-22', '2019-02-14'], 'place':['A', 'A','A','A','B','C']})
TEMP = pd.DataFrame({'date':['2019-01-10', '2019-01-14', '2019-01-20', '2019-01-20', '2019-01-22', '2019-01-23', '2019-01-24', '2019-02-14', '2019-02-14'], 'place':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'], 'temp': [5, 7, 9, 10, 10, 2, 0, 10, 11], 'quantity': [3,5,2,1,1,2,2,1,3]})
DF['date'] = pd.to_datetime(DF['date'])
TEMP['date'] = pd.to_datetime(TEMP['date'])

The code:

if 'temp' in DF.columns: 
    del DF['temp']
else: print('No variable temp found in DF')


def filltemps(dat1, dat2):
    """dat1: TEMP
    dat2: DF"""

    global totmp
    places = dat2['place'].unique()
    mx=len(places)
    totmp = pd.DataFrame(columns=['date', 'temp', 'place'])

    for i in range(mx):
        mp=[]
        dd1=[]
        nsp = pd.DataFrame(dat1[ (dat1['place']==places[i]) ])
        nsp = nsp[['date', 'quantity', 'temp']]
        prod = lambda w,z: w*z
        nsp['sumn'] = prod(nsp['temp'], nsp['quantity'])
        wavg = lambda y,x: y/x
        c3 = wavg(nsp.groupby('date')['sumn'].agg('sum'), nsp.groupby('date')['quantity'].agg('sum'))
        mp = pd.DataFrame(c3.asfreq('D')).ffill().reset_index(drop=False)
        mp.columns = ['date', 'temp']
        mp['place'] = np.array([places[i]] * len(mp))
        mp['date'] = pd.to_datetime(mp['date'])
        dd1 = dat2.loc[dat2['place']==places[i], ['date']]
        mp = mp[ mp['date'].isin(list(pd.to_datetime(dd1['date']))) ]
        totmp = pd.concat([totmp, mp])
    return totmp

Timing by %timeit TEMP2 = filltemps(TEMP, DF) shows 116 ms ± 401 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In TEMP2, it can be seen that as expected a weighted average was computed for place C:

TEMP2 = filltemps(TEMP, DF)

TEMP2 will be merged with the original DF. Note that there can be missing values if no earlier data were available.

DF = pd.merge(DF, TEMP2, how='left', on=['date', 'place'])

DF, the expected outcome, should look like this:

enter image description here

Your pointers and help are very much appreciated!

Upvotes: 1

Views: 237

Answers (3)

JoergVanAken
JoergVanAken

Reputation: 1286

My solution is similay to the one ob Vaishali but I want to point out one pitfall of asfreq.

Let's start at the beginning. We calculate the weighted averages:

TEMP.groupby(['date', 'place']).apply(lambda x: np.round(np.average(x['temp'],weights=x['quantity']), 2)).reset_index(name = 'temp')

Now we calculate the full date range:

rng = pd.date_range(DF.date.min(), DF.date.max(), freq='D')

We use this data range to reindex the temperatures:

TEMP = TEMP.groupby('place')['date', 'temp'].apply(lambda x: x.set_index('date').reindex(rng).ffill()).reset_index().rename(columns={'level_1': 'date'})

In contrast to as_freq we will now also handle the case that the temperature time series is "shorter" then the location series.

Finally we can put everything together:

DF.merge(TEMP, on=['date', 'place'])

If we sligthly modify the input by changing the last date:

DF = pd.DataFrame({'date': ['2019-01-01', '2019-01-11', '2019-01-13', '2019-01-14', '2019-01-22', '2019-02-15'], 'place':['A', 'A','A','A','B','C']})

the solution of Ruthger delivers:

        date place  temp
1 2019-01-11     A   5.0
3 2019-01-13     A   5.0
4 2019-01-14     A   7.0
0 2019-01-22     B  10.0

Vaishali:

        date place  temp
0 2019-01-01     A   NaN
1 2019-01-11     A   5.0
2 2019-01-13     A   5.0
3 2019-01-14     A   7.0
4 2019-01-22     B  10.0
5 2019-02-15     C   NaN

Joerg:

0 2019-01-01     A    NaN
1 2019-01-11     A   5.00
2 2019-01-13     A   5.00
3 2019-01-14     A   7.00
4 2019-01-22     B  10.00
5 2019-02-15     C  10.75

Briefly the terms:

%timeit Ruthger(TEMP, DF)
526 ms ± 36.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit Vaishali(TEMP, DF)
12.1 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit Joerg(TEMP, DF)
14.9 ms ± 872 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 0

Vaishali
Vaishali

Reputation: 38415

Calculate weighted average in TEMP. Fill missing dates in TEMP by place and merge the resulting dataframe with DF.

TEMP.date = pd.to_datetime(TEMP.date) 

TEMP = TEMP.groupby(['date', 'place']).apply(lambda x: np.round(np.average(x['temp'],weights=x['quantity']), 2)).reset_index(name = 'temp')

DF.merge(TEMP.set_index('date').groupby('place')['temp'].apply(lambda x: x.asfreq('D', method='ffill')).reset_index(), 'left')

    date        place   temp
0   2019-01-01  A       NaN
1   2019-01-11  A       5.00
2   2019-01-13  A       5.00
3   2019-01-14  A       7.00
4   2019-01-22  B       10.00
5   2019-02-14  C       10.75

Upvotes: 1

P Maschhoff
P Maschhoff

Reputation: 186

Here's a replacement for filltemps that works 10x faster on my machine. It also uses apply, as requested :)

def interpolate_temps(temp_data, other_data):
    """ Interpolate temperature by place and location on a daily frequency.

    Parameters
    ----------
    temp_data : pd.DataFrame
        with columns:
            date : datetime
            place : categorical
            temp : numerical
            quantity : numerical

    other_Data : pd.DataFrame
        with columns:
            date : datetime
            place : categorical

    Returns
    -------
    pd.DataFrame
        With date and place in the index and a temp column
    """
    def weighted_average(df, col1, col2):
        """Average of col1 weighted by col2"""
        return (df[col1] * df[col2]).sum() / df[col2].sum()

    results = []
    for place in temp_data.place.unique():
        dates_needed = other_data[other_data.place == place].date.unique()
        result = (temp_data[temp_data.place==place]
                  .groupby('date')
                  .apply(weighted_average, 'temp', 'quantity')
                  .reindex(dates_needed, method='ffill')
                  .to_frame('temp')
                  .assign(place=place)
                  .set_index('place', append=True)
                 )
        results.append(result)
    return pd.concat(results)

It follows the same basic structure as yours:

  1. Iterate through each place.

    1. Take the weighted average of temperature by dates.

    2. Fill in the missing values with ffill.

    3. Add place back in as a column.

  2. Join the results for each place

The major advance in time is downselecting to the dates needed for a particular place before running ffill. This should also make it more memory-efficient, as should the fact that we're not making so many named intermediate objects. The biggest place where this was an issue in your code was totmp = pd.concat([totmp, mp]), where you created len(place) different versions of totmp, whereas in interpolate_temps we wait until the end to pd.concat(results) all at once.

The result of interpolate_temps has date and place in the index, so you can join it like this:

DF.join(interpolate_temps(TEMP, DF), on=['date', 'place'])

or reset the index and merge like you had been doing.

The only other advice I have is to use more verbose variable names. I never really figured out what many of them meant, and that made it harder to understand your code.

Upvotes: 1

Related Questions