Replace certain fields in dataframe based on conditions

Question

I have a dataframe as follows. The following is for just 1 patient and 1 particular test.It can have multiple other tests with similar look.

ptid,blast_date,test_name,result_date,test_result,date_diff
PT381201021,2019-08-22,Albumin,2019-08-14,4.3,8
PT381201021,2019-05-17,Albumin,NA,NA,0
PT381201021,2019-05-18,Albumin,NA,NA,0
PT381201021,2019-05-21,Albumin,NA,NA,0
PT381201021,2019-05-23,Albumin,NA,NA,0
PT381201021,2019-05-16,Albumin,NA,NA,0
PT381201021,2019-05-19,Albumin,NA,NA,0
PT381201021,2019-05-22,Albumin,NA,NA,0
PT381201021,2019-05-20,Albumin,NA,NA,0

I want the result_date, test_result for "Albumin" in this case to be populated from a previous blast_date if it is under certain threshold lets assume 3 months in this case. So I want the following row to be populated as follows:

PT381201021,2019-05-23,Albumin,2019-08-14,4.3,0

You can leave the date_diff colm as it is.

So the final dataframe expected as follows:-

ptid,blast_date,test_name,result_date,test_result,date_diff
PT381201021,2019-08-22,Albumin,2019-08-14,4.3,8
PT381201021,2019-05-17,Albumin,NA,NA,0
PT381201021,2019-05-18,Albumin,NA,NA,0
PT381201021,2019-05-21,Albumin,NA,NA,0
PT381201021,2019-05-23,Albumin,2019-08-14,4.3,0
PT381201021,2019-05-16,Albumin,NA,NA,0
PT381201021,2019-05-19,Albumin,NA,NA,0
PT381201021,2019-05-22,Albumin,NA,NA,0
PT381201021,2019-05-20,Albumin,NA,NA,0

I tried to use the lag function but have some difficulties in that. Looking for a pyspark way to solve this.

murtihash · Accepted Answer

You should use window functions, with rangeBetween on seconds.

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w=Window().partitionBy("ptid","test_name").orderBy(F.to_timestamp("blast_date","yyyy-MM-dd").cast("long")).rangeBetween(Window.currentRow,86400*91)

df.withColumn("collect", F.collect_list(F.array("result_date","test_result")).over(w))\
  .withColumn("collect", F.expr("""filter(collect,x-> array_contains(x,'NA')!=True)""")[0])\
  .withColumn("result_date", F.when((F.col("result_date")=='NA')&(F.col("collect").isNotNull()),F.col("collect")[0]).otherwise(F.col("result_date")))\
  .withColumn("test_result", F.when((F.col("test_result")=='NA')&(F.col("collect").isNotNull()),F.col("collect")[1]).otherwise(F.col("test_result"))).drop("timestamp","collect").show(truncate=False)

+-----------+----------+---------+-----------+-----------+---------+
|ptid       |blast_date|test_name|result_date|test_result|date_diff|
+-----------+----------+---------+-----------+-----------+---------+
|PT381201021|2019-05-16|Albumin  |NA         |NA         |0        |
|PT381201021|2019-05-17|Albumin  |NA         |NA         |0        |
|PT381201021|2019-05-18|Albumin  |NA         |NA         |0        |
|PT381201021|2019-05-19|Albumin  |NA         |NA         |0        |
|PT381201021|2019-05-20|Albumin  |NA         |NA         |0        |
|PT381201021|2019-05-21|Albumin  |NA         |NA         |0        |
|PT381201021|2019-05-22|Albumin  |NA         |NA         |0        |
|PT381201021|2019-05-23|Albumin  |2019-08-14 |4.3        |0        |
|PT381201021|2019-08-22|Albumin  |2019-08-14 |4.3        |8        |
+-----------+----------+---------+-----------+-----------+---------+

Replace certain fields in dataframe based on conditions

Answers (2)

Related Questions