Reputation: 5555
Let's say that I have the following data-frame:
df = pd.DataFrame({"id": [1, 1, 1, 2, 2, 2, 3, 3, 3, 3], "date": [pd.Timestamp(2002, 2, 2), pd.Timestamp(2003, 3, 3), pd.Timestamp(2004, 4, 4), pd.Timestamp(2005, 5, 5), pd.Timestamp(2006, 6, 6), pd.Timestamp(2007, 7, 7), pd.Timestamp(2008, 8, 8), pd.Timestamp(2009, 9, 9), pd.Timestamp(2010, 10, 10), pd.Timestamp(2011, 11, 11)], "numeric": [0.9, 0.4, 0.2, 0.6, np.nan, 0.8, 0.7, np.nan, np.nan, 0.5], "nominal": [0, 1, 0, 1, 0, 0, 0, 1, 1, 1]})
What I want to achieve is to strip rows at the end of each group (assuming that the rows are grouped by id
), such that the rows will be removed until a non-nan
value will appear for the numeric
column. Additionally, the last row for each group will always have a non-nan
value for the numeric
column and the last row should always be removed. So, the resulting data-frame is:
result_df = pd.DataFrame({"id": [1, 1, 2, 3], "date": [pd.Timestamp(2002, 2, 2), pd.Timestamp(2003, 3, 3), pd.Timestamp(2005, 5, 5), pd.Timestamp(2008, 8, 8)], "numeric": [0.9, 0.4, 0.6, 0.7], "nominal": [0, 1, 1, 0]})
More explanation on how we get to the resulting data-frame:
id == 1
only the last row is removed since in the row before the last one there is a value for the numeric
column.id == 2
the last two rows are removed because the last row is removed by default the row before the last one has a nan
value.id == 3
the last three rows are removed because the last row is removed by default and the first non-nan
value is on the forth row counting from below.Moreover, what I am currently doing is:
df.groupby("id", as_index=False).apply(lambda x: x.iloc[:-1]).reset_index(drop=True)
However, this only removes the last row for each group and I want to remove the last N
rows based on the condition explained above.
Please let me know if you need any further information and looking forward to your answers!
Upvotes: 1
Views: 115
Reputation: 4004
For the specific example you have posted just dropping the NaNs before grouping does the trick:
df = df.dropna().groupby('id').apply(lambda x: x.iloc[:-1]).reset_index(drop=True)
df
Out[58]:
id date numeric nominal
0 1 2002-02-02 0.9 0
1 1 2003-03-03 0.4 1
2 2 2005-05-05 0.6 1
3 3 2008-08-08 0.7 0
If you have a non-contiguous NaNs and you want to remove only the last block of NaNs:
def strip_rows(X):
X = X.iloc[:-1, :]
while pd.isna(X.iloc[-1, 2]):
X = X.iloc[:-1, :]
return X
df_1 = pd.DataFrame({"id": [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3],
"date": [pd.Timestamp(2002, 2, 2),
pd.Timestamp(2003, 3, 3),
pd.Timestamp(2004, 4, 4),
pd.Timestamp(2005, 5, 5),
pd.Timestamp(2006, 6, 6),
pd.Timestamp(2007, 7, 7),
pd.Timestamp(2008, 8, 8),
pd.Timestamp(2009, 9, 9),
pd.Timestamp(2010, 10, 10),
pd.Timestamp(2011, 11, 11),
pd.Timestamp(2011, 12, 12),
pd.Timestamp(2012, 1, 1)],
"numeric": [0.9, 0.4, 0.2, 0.6, np.nan, 0.8, 0.7, np.nan, np.nan, 0.5, np.nan, 0.3],
"nominal": [0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1]})
df_2 = df_1.groupby('id').apply(strip_rows).reset_index(drop=True)
df_1
Out[151]:
id date numeric nominal
0 1 2002-02-02 0.9 0
1 1 2003-03-03 0.4 1
2 1 2004-04-04 0.2 0
3 2 2005-05-05 0.6 1
4 2 2006-06-06 NaN 0
5 2 2007-07-07 0.8 0
6 3 2008-08-08 0.7 0
7 3 2009-09-09 NaN 1
8 3 2010-10-10 NaN 1
9 3 2011-11-11 0.5 1
10 3 2011-12-12 NaN 0
11 3 2012-01-01 0.3 1
df_2
Out[152]:
id date numeric nominal
0 1 2002-02-02 0.9 0
1 1 2003-03-03 0.4 1
2 2 2005-05-05 0.6 1
3 3 2008-08-08 0.7 0
4 3 2009-09-09 NaN 1
5 3 2010-10-10 NaN 1
6 3 2011-11-11 0.5 1
Upvotes: 2