Reputation: 27
I have a dataframe as follows:
ptid,blast_date,test_name,result_date,test_result,date_diff
PT085087309 2013-10-03 Influenza A.H1.respiratory.qualitative NA NA 0
PT085087309 2013-10-04 Influenza A.H1.respiratory.qualitative 2013-10-04 not detected 0
PT085087309 2013-10-07 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 2
PT085087309 2013-10-09 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 4
PT085087309 2013-10-14 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 9
PT085087309 2013-10-15 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 10
PT085087309 2013-10-18 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 13
PT085087309 2013-10-21 Influenza A.H1.respiratory.qualitative NA NA 0
PT085087309 2013-10-23 Influenza A.H1.respiratory.qualitative NA NA 0
PT085087309 2013-10-24 Influenza A.H1.respiratory.qualitative NA NA 0
PT085087309 2013-10-25 Influenza A.H1.respiratory.qualitative NA NA 0
PT085087309 2013-10-27 Influenza A.H1.respiratory.qualitative NA NA 0
PT085087309 2013-10-28 Influenza A.H1.respiratory.qualitative NA NA 0
PT085087309 2013-10-31 Influenza A.H1.respiratory.qualitative NA NA 0
PT085087309 2013-11-01 Influenza A.H1.respiratory.qualitative NA NA 0
PT085087309 2013-11-04 Influenza A.H1.respiratory.qualitative NA NA 0
PT085087309 2013-11-06 Influenza A.H1.respiratory.qualitative NA NA 0
PT085087309 2013-11-08 Influenza A.H1.respiratory.qualitative NA NA 0
PT085087309 2013-11-11 Influenza A.H1.respiratory.qualitative NA NA 0
PT085087309 2013-11-14 Influenza A.H1.respiratory.qualitative NA NA 0
PT085087309 2013-11-15 Influenza A.H1.respiratory.qualitative 2013-11-15 not detected 0
PT085087309 2013-11-18 Influenza A.H1.respiratory.qualitative 2013-11-15 not detected 3
PT085087309 2013-11-19 Influenza A.H1.respiratory.qualitative 2013-11-15 not detected 4
PT085087309 2013-11-21 Influenza A.H1.respiratory.qualitative 2013-11-15 not detected 6
PT085087309 2014-09-29 Influenza A.H1.respiratory.qualitative 2014-09-21 not detected 8
PT085087309 2014-09-30 Influenza A.H1.respiratory.qualitative 2014-09-21 not detected 9
PT085087309 2014-10-01 Influenza A.H1.respiratory.qualitative 2014-09-21 not detected 10
PT085087309 2014-10-02 Influenza A.H1.respiratory.qualitative 2014-09-21 not detected 11
PT085087309 2014-10-03 Influenza A.H1.respiratory.qualitative 2014-09-21 not detected 12
PT085087309 2014-10-04 Influenza A.H1.respiratory.qualitative 2014-09-21 not detected 13
PT085087309 2014-10-06 Influenza A.H1.respiratory.qualitative NA NA 0
PT085087309 2014-10-07 Influenza A.H1.respiratory.qualitative NA NA 0
PT085087309 2014-10-09 Influenza A.H1.respiratory.qualitative NA NA 0
I want the above dataframe to check the previous rows where ever result and result_date column is NA and get the data from the previous rows nearest "date" wise and populate it instead of NA.
The resulting dataframe:
ptid,blast_date,test_name,result_date,test_result,date_diff
PT085087309 2013-10-03 Influenza A.H1.respiratory.qualitative NA NA 0
PT085087309 2013-10-04 Influenza A.H1.respiratory.qualitative 2013-10-04 not detected 0
PT085087309 2013-10-07 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 2
PT085087309 2013-10-09 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 4
PT085087309 2013-10-14 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 9
PT085087309 2013-10-15 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 10
PT085087309 2013-10-18 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 13
PT085087309 2013-10-21 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 0
PT085087309 2013-10-23 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 0
PT085087309 2013-10-24 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 0
PT085087309 2013-10-25 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 0
PT085087309 2013-10-27 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 0
PT085087309 2013-10-28 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 0
PT085087309 2013-10-31 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 0
PT085087309 2013-11-01 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 0
PT085087309 2013-11-04 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 0
PT085087309 2013-11-06 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 0
PT085087309 2013-11-08 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 0
PT085087309 2013-11-11 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 0
PT085087309 2013-11-14 Influenza A.H1.respiratory.qualitative 2013-10-05 not detected 0
PT085087309 2013-11-15 Influenza A.H1.respiratory.qualitative 2013-11-15 not detected 0
PT085087309 2013-11-18 Influenza A.H1.respiratory.qualitative 2013-11-15 not detected 3
PT085087309 2013-11-19 Influenza A.H1.respiratory.qualitative 2013-11-15 not detected 4
PT085087309 2013-11-21 Influenza A.H1.respiratory.qualitative 2013-11-15 not detected 6
PT085087309 2014-09-29 Influenza A.H1.respiratory.qualitative 2014-09-21 not detected 8
PT085087309 2014-09-30 Influenza A.H1.respiratory.qualitative 2014-09-21 not detected 9
PT085087309 2014-10-01 Influenza A.H1.respiratory.qualitative 2014-09-21 not detected 10
PT085087309 2014-10-02 Influenza A.H1.respiratory.qualitative 2014-09-21 not detected 11
PT085087309 2014-10-03 Influenza A.H1.respiratory.qualitative 2014-09-21 not detected 12
PT085087309 2014-10-04 Influenza A.H1.respiratory.qualitative 2014-09-21 not detected 13
PT085087309 2014-10-06 Influenza A.H1.respiratory.qualitative 2014-09-21 not detected 0
PT085087309 2014-10-07 Influenza A.H1.respiratory.qualitative 2014-09-21 not detected 0
PT085087309 2014-10-09 Influenza A.H1.respiratory.qualitative 2014-09-21 not detected 0
I have the following code but I get the result data from the earliest date. Can you guys please advise me how to change the code below so as to get the result data from the nearest date row.
w=Window().partitionBy("ptid","test_name").orderBy("blast_date")
df_tests_filled = df_all_tests.withColumn("collect", f.collect_list(f.array("result_date","test_result")).over(w))\
.withColumn("collect", f.expr("""filter(collect,x-> array_contains(x,'NA')!=True)""")[0])\
.withColumn("result_date", f.when((f.col("result_date")=='NA')&(f.col("collect").isNotNull()),f.col("collect")[0]).otherwise(f.col("result_date")))\
.withColumn("test_result", f.when((f.col("test_result")=='NA')&(f.col("collect").isNotNull()),f.col("collect")[1]).otherwise(f.col("test_result"))).drop("timestamp","collect")
Upvotes: 0
Views: 64
Reputation: 86
I think this is what you are looking for:
df['col'] = df['col'].fillna(method='ffill')
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
Upvotes: 1