Trim Group by Column/Series Sequence in Pandas by NaN Ocurrence

Question

I have a data frame as follows:

user_id	metric_date	metric_val1	is_churn
3	2021-01	NaN	True
3	2021-02	NaN	True
3	2021-03	0.4	False
3	2021-04	0.5	False
3	2021-05	NaN	True
4	2021-01	0.1	False
4	2021-02	0.3	False
4	2021-03	0.2	False
4	2021-04	NaN	True
4	2021-05	NaN	True

Suppose there are other metric columns, but the main reference is metric_val1, how can I grouping by user_id and trim all row that have NaN value before the first valid metric_val1, and keeping only the last NaN after the last valid value for metric_val1, the output should be something like that (Assume that there is no gap in the valid values) :

user_id	metric_date	metric_val1	is_churn
3	2021-03	0.4	False
3	2021-04	0.5	False
3	2021-05	NaN	True
4	2021-01	0.1	False
4	2021-02	0.3	False
4	2021-03	0.2	False
4	2021-04	NaN	True

Can someone help me with an efficient way to do that in pandas?

wwnde · Accepted Answer

Please boolean select all non Non values or NaN values which immediately follow non Nan values in a group and mask. Code below;

df[df.groupby('user_id')['metric_val1'].apply(lambda x : x.notna()|x.isna()&x.shift(1).notna())]



    user_id metric_date  metric_val1  is_churn
2        3     2021-03          0.4     False
3        3     2021-04          0.5     False
4        3     2021-05          NaN      True
5        4     2021-01          0.1     False
6        4     2021-02          0.3     False
7        4     2021-03          0.2     False
8        4     2021-04          NaN      True

If you have a large dataframe and are worried of memory and speed. Could try use pyspark. Just instantiate a pyspark session. Pyspark is scalable;

from  pyspark.sql.functions import *
import pyspark.sql.functions as F
from pyspark.sql import Window
k =Window.partitionBy('user_id').orderBy('user_id','metric_date')
(
  df.withColumn('t', lag('metric_val1').over(k))#Introduce column t which draws immediate preceding columns' value
  .filter((F.col('t')=='NaN')|(F.col('metric_val1')!='NaN'))#Filter out t is NaN or metric_val1 is not NaN
  .drop('t')#drop the temp column
).show()

+-------+-----------+-----------+--------+
|user_id|metric_date|metric_val1|is_churn|
+-------+-----------+-----------+--------+
|      3|    2021-02|        NaN|    true|
|      3|    2021-03|        0.4|   false|
|      3|    2021-04|        0.5|   false|
|      4|    2021-01|        0.1|   false|
|      4|    2021-02|        0.3|   false|
|      4|    2021-03|        0.2|   false|
|      4|    2021-05|        NaN|    true|
+-------+-----------+-----------+--------+

Trim Group by Column/Series Sequence in Pandas by NaN Ocurrence

Answers (1)

Related Questions