Anirban Mishra
Anirban Mishra

Reputation: 27

Get data from nearest row instead of earliest row

I have a dataframe as follows:

ptid,blast_date,test_name,result_date,test_result,date_diff
PT085087309 2013-10-03  Influenza A.H1.respiratory.qualitative  NA  NA  0
PT085087309 2013-10-04  Influenza A.H1.respiratory.qualitative  2013-10-04  not detected    0
PT085087309 2013-10-07  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    2
PT085087309 2013-10-09  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    4
PT085087309 2013-10-14  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    9
PT085087309 2013-10-15  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    10
PT085087309 2013-10-18  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    13
PT085087309 2013-10-21  Influenza A.H1.respiratory.qualitative  NA  NA  0
PT085087309 2013-10-23  Influenza A.H1.respiratory.qualitative  NA  NA  0
PT085087309 2013-10-24  Influenza A.H1.respiratory.qualitative  NA  NA  0
PT085087309 2013-10-25  Influenza A.H1.respiratory.qualitative  NA  NA  0
PT085087309 2013-10-27  Influenza A.H1.respiratory.qualitative  NA  NA  0
PT085087309 2013-10-28  Influenza A.H1.respiratory.qualitative  NA  NA  0
PT085087309 2013-10-31  Influenza A.H1.respiratory.qualitative  NA  NA  0
PT085087309 2013-11-01  Influenza A.H1.respiratory.qualitative  NA  NA  0
PT085087309 2013-11-04  Influenza A.H1.respiratory.qualitative  NA  NA  0
PT085087309 2013-11-06  Influenza A.H1.respiratory.qualitative  NA  NA  0
PT085087309 2013-11-08  Influenza A.H1.respiratory.qualitative  NA  NA  0
PT085087309 2013-11-11  Influenza A.H1.respiratory.qualitative  NA  NA  0
PT085087309 2013-11-14  Influenza A.H1.respiratory.qualitative  NA  NA  0
PT085087309 2013-11-15  Influenza A.H1.respiratory.qualitative  2013-11-15  not detected    0
PT085087309 2013-11-18  Influenza A.H1.respiratory.qualitative  2013-11-15  not detected    3
PT085087309 2013-11-19  Influenza A.H1.respiratory.qualitative  2013-11-15  not detected    4
PT085087309 2013-11-21  Influenza A.H1.respiratory.qualitative  2013-11-15  not detected    6
PT085087309 2014-09-29  Influenza A.H1.respiratory.qualitative  2014-09-21  not detected    8
PT085087309 2014-09-30  Influenza A.H1.respiratory.qualitative  2014-09-21  not detected    9
PT085087309 2014-10-01  Influenza A.H1.respiratory.qualitative  2014-09-21  not detected    10
PT085087309 2014-10-02  Influenza A.H1.respiratory.qualitative  2014-09-21  not detected    11
PT085087309 2014-10-03  Influenza A.H1.respiratory.qualitative  2014-09-21  not detected    12
PT085087309 2014-10-04  Influenza A.H1.respiratory.qualitative  2014-09-21  not detected    13
PT085087309 2014-10-06  Influenza A.H1.respiratory.qualitative  NA  NA  0
PT085087309 2014-10-07  Influenza A.H1.respiratory.qualitative  NA  NA  0
PT085087309 2014-10-09  Influenza A.H1.respiratory.qualitative  NA  NA  0

I want the above dataframe to check the previous rows where ever result and result_date column is NA and get the data from the previous rows nearest "date" wise and populate it instead of NA.

The resulting dataframe:

ptid,blast_date,test_name,result_date,test_result,date_diff
PT085087309 2013-10-03  Influenza A.H1.respiratory.qualitative  NA  NA  0
PT085087309 2013-10-04  Influenza A.H1.respiratory.qualitative  2013-10-04  not detected    0
PT085087309 2013-10-07  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    2
PT085087309 2013-10-09  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    4
PT085087309 2013-10-14  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    9
PT085087309 2013-10-15  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    10
PT085087309 2013-10-18  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    13
PT085087309 2013-10-21  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    0
PT085087309 2013-10-23  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    0
PT085087309 2013-10-24  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    0
PT085087309 2013-10-25  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    0
PT085087309 2013-10-27  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    0
PT085087309 2013-10-28  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    0
PT085087309 2013-10-31  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    0
PT085087309 2013-11-01  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    0
PT085087309 2013-11-04  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    0
PT085087309 2013-11-06  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    0
PT085087309 2013-11-08  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    0
PT085087309 2013-11-11  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    0
PT085087309 2013-11-14  Influenza A.H1.respiratory.qualitative  2013-10-05  not detected    0
PT085087309 2013-11-15  Influenza A.H1.respiratory.qualitative  2013-11-15  not detected    0
PT085087309 2013-11-18  Influenza A.H1.respiratory.qualitative  2013-11-15  not detected    3
PT085087309 2013-11-19  Influenza A.H1.respiratory.qualitative  2013-11-15  not detected    4
PT085087309 2013-11-21  Influenza A.H1.respiratory.qualitative  2013-11-15  not detected    6
PT085087309 2014-09-29  Influenza A.H1.respiratory.qualitative  2014-09-21  not detected    8
PT085087309 2014-09-30  Influenza A.H1.respiratory.qualitative  2014-09-21  not detected    9
PT085087309 2014-10-01  Influenza A.H1.respiratory.qualitative  2014-09-21  not detected    10
PT085087309 2014-10-02  Influenza A.H1.respiratory.qualitative  2014-09-21  not detected    11
PT085087309 2014-10-03  Influenza A.H1.respiratory.qualitative  2014-09-21  not detected    12
PT085087309 2014-10-04  Influenza A.H1.respiratory.qualitative  2014-09-21  not detected    13
PT085087309 2014-10-06  Influenza A.H1.respiratory.qualitative  2014-09-21  not detected    0
PT085087309 2014-10-07  Influenza A.H1.respiratory.qualitative  2014-09-21  not detected    0
PT085087309 2014-10-09  Influenza A.H1.respiratory.qualitative  2014-09-21  not detected    0

I have the following code but I get the result data from the earliest date. Can you guys please advise me how to change the code below so as to get the result data from the nearest date row.

w=Window().partitionBy("ptid","test_name").orderBy("blast_date")

df_tests_filled = df_all_tests.withColumn("collect", f.collect_list(f.array("result_date","test_result")).over(w))\
  .withColumn("collect", f.expr("""filter(collect,x-> array_contains(x,'NA')!=True)""")[0])\
  .withColumn("result_date", f.when((f.col("result_date")=='NA')&(f.col("collect").isNotNull()),f.col("collect")[0]).otherwise(f.col("result_date")))\
  .withColumn("test_result", f.when((f.col("test_result")=='NA')&(f.col("collect").isNotNull()),f.col("collect")[1]).otherwise(f.col("test_result"))).drop("timestamp","collect")

Upvotes: 0

Views: 64

Answers (1)

usher
usher

Reputation: 86

I think this is what you are looking for:

df['col'] = df['col'].fillna(method='ffill')

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

Upvotes: 1

Related Questions