Reputation: 41
I've got a timeseries of intermitent daily data like this.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': ['2020-01-01', '2020-01-02', '2020-01-02','2020-01-02','2020-01-03','2020-01-04','2020-01-07','2020-01-08','2020-01-08','2020-01-10','2020-01-13','2020-01-15'],
'Price': [200, 324, 320, 421, 240, np.NaN, 500, 520, 531, np.NaN, 571, np.NaN]})
df['Date']= pd.to_datetime(df['Date'])
df.set_index('Date')
df
Result:
+------------+-------+
| Date | Price |
+------------+-------+
| 2020-01-01 | 200 |
+------------+-------+
| 2020-01-02 | 324 |
+------------+-------+
| 2020-01-02 | 320 | -- 1st duplicate for 2020-01-02
+------------+-------+
| 2020-01-02 | 421 | -- 2nd duplicate for 2020-01-02
+------------+-------+
| 2020-01-03 | 240 |
+------------+-------+
| 2020-01-04 | NaN |
+------------+-------+
| 2020-01-07 | 500 |
+------------+-------+
| 2020-01-08 | 520 |
+------------+-------+
| 2020-01-08 | 531 | -- 1st duplicate for 2020-01-08
+------------+-------+
| 2020-01-10 | NaN |
+------------+-------+
| 2020-01-13 | 571 |
+------------+-------+
| 2020-01-15 | NaN |
+------------+-------+
I need to fill the NaN values with prices from nearest available date where there is more than 1 price recorded (duplicate) i.e.
Upvotes: 3
Views: 1101
Reputation: 5682
I have never been good with pandas so I treat the frame more-or-less as 2D array (thus there might be more efficient ways to do this with pandas... My attempt on this is the 2nd solution)
So, the idea is:
last_dup_row_index
which is always points to the latest "movable" row (see inline comments for edge cases)to_move_indexes
. This is a list of tuples of moves that can be performedAt the end of the above loop you have all you need to modify your frame:
The code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': ['2020-01-01', '2020-01-02', '2020-01-02','2020-01-03','2020-01-04','2020-01-07','2020-01-08'],
'Price': [200, 324, 320, 240, np.NaN, 500, 520]})
df['Date']= pd.to_datetime(df['Date'])
df.set_index('Date')
prev_row = None
last_dup_row_index = None
to_move_indexes = []
for index, row in df.iterrows():
# Check if we have a None in the current row and
# a duplicate row waiting for us
if pd.isna(row['Price']) and last_dup_row_index is not None:
print(f"To move price from {last_dup_row_index} to {index}")
to_move_indexes.append((last_dup_row_index, index))
# Check if this row and the prev one have the
# same date
if prev_row is not None and prev_row['Date'] == row['Date']:
# There is a case where for the same Date you have
# two entries out of which one is NaN. Here use the
# other one
if not pd.isna(row['Price']):
print(f"Setting duplicate to: idx={index},\n{row}")
last_dup_row_index = index
elif not pd.isna(prev_row['Price']):
print(f"Setting duplicate to: idx={index - 1},\n{prev_row}")
last_dup_row_index = index - 1
else:
# There is an edge case where two NaNs follow each
# other - not changing last duplicate
print(f"Warning: two NaN @{row['Date']}")
prev_row = row
print(to_move_indexes)
# Perform moves
for from_idx, to_idx in to_move_indexes:
df.at[to_idx, 'Price'] = df.at[from_idx, 'Price']
print("\nFrame after moves:")
print(df)
# Preform deletes if you need to
df.drop([x for x, _ in to_move_indexes], inplace=True)
print("\nFrame after deletes:")
print(df)
And the output you get is:
# Here we detected that row index 2 is a duplicate (with index 1)
Setting duplicate to: idx=2,
Date 2020-01-02 00:00:00
Price 320
Name: 2, dtype: object
# Here we see that row index 4 is missing Price. However
# we have the previous duplicate (2) waiting for us so we
# add a "move" as (2, 4) to our list
To move price from 2 to 4
# The final list is
[(2, 4)]
Frame after moves:
Date Price
0 2020-01-01 200.0
1 2020-01-02 324.0
2 2020-01-02 320.0
3 2020-01-03 240.0
4 2020-01-04 320.0
5 2020-01-07 500.0
6 2020-01-08 520.0
Frame after deletes:
Date Price
0 2020-01-01 200.0
1 2020-01-02 324.0
3 2020-01-03 240.0
4 2020-01-04 320.0
5 2020-01-07 500.0
6 2020-01-08 520.0
UPDATE: Second way
# Calculate diff on dates column and keep the
# ones that are same (returns series)
dups = df.Date.diff() == "0 days"
# A cryptic (for me) way to get all the indexes
# where the value is True
dup_indexes = dups.index[dups].to_list()
# Now get the indexes where the Price is Nan
nans = pd.isnull(df).any(1)
nan_indexes = nans.index[nans].to_list()
# Create moves: the nan_index should be greater than
# the dup_index but as close as possible
moves = []
for nan_index in nan_indexes:
# dup_indexes are sorted so get the last one,
# smaller than the nan_index
dup_index = [x for x in dup_indexes if x < nan_index][-1]
if dup_index:
moves.append((dup_index, nan_index))
# Do moves and deletes
for from_idx, to_idx in moves:
df.at[to_idx, 'Price'] = df.at[from_idx, 'Price']
df.drop([x for x, _ in moves], inplace=True)
print(df)
Upvotes: 0
Reputation: 8219
Here is a Pandas solution, step by step
First, we groupby Price by Date and put them in a list for each date, that we then unwrap into separate columns, which we can then rename
df2 = (
df.groupby('Date')['Price']
.apply(list)
.apply(pd.Series)
.rename(columns = {0:'Price',1:'Other'})
)
df2
so we get
Price Other
Date
2020-01-01 200.0 NaN
2020-01-02 324.0 320.0
2020-01-03 240.0 NaN
2020-01-04 NaN NaN
2020-01-07 500.0 NaN
2020-01-08 520.0 NaN
Here Price has the first price for that date, and Other the second price for that date, if available
Now we ffill() Other
, so that propagates second values forward until the new second value is found etc.
df2['Other'] = df2['Other'].ffill()
so we get
Price Other
Date
2020-01-01 200.0 NaN
2020-01-02 324.0 320.0
2020-01-03 240.0 320.0
2020-01-04 NaN 320.0
2020-01-07 500.0 320.0
2020-01-08 520.0 320.0
Now we can replace NaN
s in the Price column with the values from Other
column, and drop Other:
df2['Price'] = df2['Price'].fillna(df2['Other'])
df2.drop(columns = ['Other'], inplace = True)
df2
to get
Price
Date
2020-01-01 200.0
2020-01-02 324.0
2020-01-03 240.0
2020-01-04 320.0
2020-01-07 500.0
2020-01-08 520.0
Upvotes: 3
Reputation: 2111
How to tackle and intuition behind ffill
What you are looking for is a method called forward fill. Forward fill locates a null value, then it checks if there's any valid values behind it. If so, it uses it.
To understand more on how to apply the method on your data, please check the documentation of pandas fillna
here. It is detailed and provides examples, take a careful look at them and understand what each argument does.
Note, in case your previous value is also NaN, therefore, ffill
won't change it (obviously).
Pseudo code
Since you have changed the data, I can think of a pseudo-code.
First, collect all the missing data in your table using df[df.Price.isnull()]
Then check for each missing value if there are duplicates prior to it.
If so, choose the closest duplicate and replace it, else, keep it nan.
Upvotes: -1