storeguy
storeguy

Reputation: 41

Fill missing values in time-series with duplicate values from the same time-series in python

I've got a timeseries of intermitent daily data like this.

import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': ['2020-01-01', '2020-01-02', '2020-01-02','2020-01-02','2020-01-03','2020-01-04','2020-01-07','2020-01-08','2020-01-08','2020-01-10','2020-01-13','2020-01-15'], 
               'Price': [200, 324, 320, 421, 240, np.NaN, 500, 520, 531, np.NaN, 571, np.NaN]})
df['Date']= pd.to_datetime(df['Date'])
df.set_index('Date')
df

Result:

+------------+-------+
| Date       | Price |
+------------+-------+
| 2020-01-01 | 200   |
+------------+-------+
| 2020-01-02 | 324   | 
+------------+-------+
| 2020-01-02 | 320   | -- 1st duplicate for 2020-01-02
+------------+-------+
| 2020-01-02 | 421   | -- 2nd duplicate for 2020-01-02
+------------+-------+
| 2020-01-03 | 240   |
+------------+-------+
| 2020-01-04 | NaN   |
+------------+-------+
| 2020-01-07 | 500   |
+------------+-------+
| 2020-01-08 | 520   |
+------------+-------+ 
| 2020-01-08 | 531   | -- 1st duplicate for 2020-01-08 
+------------+-------+ 
| 2020-01-10 | NaN   |
+------------+-------+ 
| 2020-01-13 | 571   |
+------------+-------+ 
| 2020-01-15 | NaN   |
+------------+-------+ 

I need to fill the NaN values with prices from nearest available date where there is more than 1 price recorded (duplicate) i.e.

Upvotes: 3

Views: 1101

Answers (3)

urban
urban

Reputation: 5682

I have never been good with pandas so I treat the frame more-or-less as 2D array (thus there might be more efficient ways to do this with pandas... My attempt on this is the 2nd solution)

So, the idea is:

  • Loop your frame row-by-row
  • Always keep a pointer to the previous row so you can compare Dates (and values)
  • When a duplicate Date is found, set the last_dup_row_index which is always points to the latest "movable" row (see inline comments for edge cases)
  • While iterating, if you hit a missing Price, add a "move" into to_move_indexes. This is a list of tuples of moves that can be performed

At the end of the above loop you have all you need to modify your frame:

  1. The possible price moves
  2. The indexes you move from so you can delete those rows if you want to

The code:

import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': ['2020-01-01', '2020-01-02', '2020-01-02','2020-01-03','2020-01-04','2020-01-07','2020-01-08'],
               'Price': [200, 324, 320, 240, np.NaN, 500, 520]})
df['Date']= pd.to_datetime(df['Date'])
df.set_index('Date')

prev_row = None
last_dup_row_index = None
to_move_indexes = []

for index, row in df.iterrows():
    # Check if we have a None in the current row and
    # a duplicate row waiting for us
    if pd.isna(row['Price']) and last_dup_row_index is not None:
        print(f"To move price from {last_dup_row_index} to {index}")
        to_move_indexes.append((last_dup_row_index, index))


    # Check if this row and the prev one have the
    # same date
    if prev_row is not None and prev_row['Date'] == row['Date']:
        # There is a case where for the same Date you have
        # two entries out of which one is NaN. Here use the
        # other one
        if not pd.isna(row['Price']):
            print(f"Setting duplicate to: idx={index},\n{row}")
            last_dup_row_index = index
        elif not pd.isna(prev_row['Price']):
            print(f"Setting duplicate to: idx={index - 1},\n{prev_row}")
            last_dup_row_index = index - 1
        else:
            # There is an edge case where two NaNs follow each
            # other - not changing last duplicate
            print(f"Warning: two NaN @{row['Date']}")

    prev_row = row

print(to_move_indexes)

# Perform moves
for from_idx, to_idx in to_move_indexes:
    df.at[to_idx, 'Price'] = df.at[from_idx, 'Price']
print("\nFrame after moves:")
print(df)

# Preform deletes if you need to
df.drop([x for x, _ in to_move_indexes], inplace=True)
print("\nFrame after deletes:")
print(df)

And the output you get is:

# Here we detected that row index 2 is a duplicate (with index 1)
Setting duplicate to: idx=2,
Date     2020-01-02 00:00:00
Price                    320
Name: 2, dtype: object

# Here we see that row index 4 is missing Price. However
# we have the previous duplicate (2) waiting for us so we
# add a "move" as (2, 4) to our list  
To move price from 2 to 4

# The final list is
[(2, 4)]

Frame after moves:
        Date  Price
0 2020-01-01  200.0
1 2020-01-02  324.0
2 2020-01-02  320.0
3 2020-01-03  240.0
4 2020-01-04  320.0
5 2020-01-07  500.0
6 2020-01-08  520.0

Frame after deletes:
        Date  Price
0 2020-01-01  200.0
1 2020-01-02  324.0
3 2020-01-03  240.0
4 2020-01-04  320.0
5 2020-01-07  500.0
6 2020-01-08  520.0

UPDATE: Second way

# Calculate diff on dates column and keep the
# ones that are same (returns series)
dups = df.Date.diff() == "0 days"

# A cryptic (for me) way to get all the indexes
# where the value is True
dup_indexes = dups.index[dups].to_list()

# Now get the indexes where the Price is Nan
nans = pd.isnull(df).any(1)
nan_indexes = nans.index[nans].to_list()

# Create moves: the nan_index should be greater than 
# the dup_index but as close as possible
moves = []
for nan_index in nan_indexes:
    # dup_indexes are sorted so get the last one, 
    # smaller than the nan_index
    dup_index = [x for x in dup_indexes if x < nan_index][-1]
    if dup_index:
        moves.append((dup_index, nan_index))

# Do moves and deletes
for from_idx, to_idx in moves:
    df.at[to_idx, 'Price'] = df.at[from_idx, 'Price']
df.drop([x for x, _ in moves], inplace=True)
print(df)

Upvotes: 0

piterbarg
piterbarg

Reputation: 8219

Here is a Pandas solution, step by step

First, we groupby Price by Date and put them in a list for each date, that we then unwrap into separate columns, which we can then rename

df2 = (
    df.groupby('Date')['Price']
    .apply(list)
    .apply(pd.Series)
    .rename(columns = {0:'Price',1:'Other'})
)
df2

so we get


           Price    Other
Date        
2020-01-01  200.0   NaN
2020-01-02  324.0   320.0
2020-01-03  240.0   NaN
2020-01-04  NaN     NaN
2020-01-07  500.0   NaN
2020-01-08  520.0   NaN

Here Price has the first price for that date, and Other the second price for that date, if available

Now we ffill() Other, so that propagates second values forward until the new second value is found etc.

df2['Other'] = df2['Other'].ffill()

so we get

            Price   Other
Date        
2020-01-01  200.0   NaN
2020-01-02  324.0   320.0
2020-01-03  240.0   320.0
2020-01-04  NaN     320.0
2020-01-07  500.0   320.0
2020-01-08  520.0   320.0

Now we can replace NaNs in the Price column with the values from Other column, and drop Other:

df2['Price'] = df2['Price'].fillna(df2['Other'])
df2.drop(columns = ['Other'], inplace = True)
df2

to get


            Price
Date    
2020-01-01  200.0
2020-01-02  324.0
2020-01-03  240.0
2020-01-04  320.0
2020-01-07  500.0
2020-01-08  520.0

Upvotes: 3

ombk
ombk

Reputation: 2111

How to tackle and intuition behind ffill

What you are looking for is a method called forward fill. Forward fill locates a null value, then it checks if there's any valid values behind it. If so, it uses it.

To understand more on how to apply the method on your data, please check the documentation of pandas fillna here. It is detailed and provides examples, take a careful look at them and understand what each argument does.

Note, in case your previous value is also NaN, therefore, ffill won't change it (obviously).

Pseudo code

Since you have changed the data, I can think of a pseudo-code.

First, collect all the missing data in your table using df[df.Price.isnull()]

Then check for each missing value if there are duplicates prior to it.

If so, choose the closest duplicate and replace it, else, keep it nan.

Upvotes: -1

Related Questions