Reputation: 859
All,
I am looking for some help with the following problem. I have a way to achieve the desired result, however that requires a loop. So, here is the problem:
import pandas as pd
import numpy as np
# Assumptions:
# 1. The value at minimum index is never np.nan. I have a separate piece of logic that handles it
df = pd.DataFrame(np.random.randint(0,100,size=(15, 1)), columns=list('A'))
# Indices to null
random_indices = np.random.permutation(np.arange(1, 14))[:5]
random_indices = np.sort(random_indices)
df.loc[random_indices, 'A'] = np.nan
df1, df2 = df.copy(deep=True), df.copy(deep=True)
# Approach 1
df1 = df1.fillna(method='ffill')
# Approach 2
for i in random_indices:
df2.loc[i, 'A'] = df2.loc[i-1, 'A'] + 0.1
print(df1)
print(df2)
Please note that the value at index 0 is never np.nan and is handled separately. Approach 2 gives the desired result, but requires a loop. I would like to achieve the same result using Approach 1 or a similar function. Any help is appreciated.
Upvotes: 0
Views: 258
Reputation: 30050
df1 = df1['A'].fillna(method='ffill') + df1.groupby(df1['A'].ffill()).cumcount()/10
Let's elaborate what df1.groupby(df1['A'].ffill()).cumcount()/10
does.
Take the following dataframe as an example
1
NaN
2
NaN
NaN
3
df1['A'].ffill()
would be1
1
2
2
2
3
In this part, if you have some duplicate values before NaN, you can use df1['A'].notnull().cumsum()
to replace df1['A'].ffill()
. .notnull().cumsum()
classifies the NaN and previous one value into same group. While .ffill()
classifies the NaN and previous adjacent equal values into same group.
pandas.DataFrame.groupby() can take the Series to determine the groups. By using the result of df1['A'].ffill()
, the 1st and 2nd rows are in the same group, the 3rd, 4th, 5th rows are in the same group.
pandas.core.groupby.GroupBy.cumcount() numbers each item in each group from 0 to the length of that group - 1.
0
1
0
1
2
0
Full program
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(15, 1)), columns=list('A'))
random_indices = np.random.permutation(np.arange(1, 14))[:5]
random_indices = np.sort(random_indices)
df.loc[random_indices, 'A'] = np.nan
df1, df2 = df.copy(deep=True), df.copy(deep=True)
df1 = df1['A'].ffill() + df1.groupby(df1['A'].ffill()).cumcount()/10
Upvotes: 2