Trying to find the nearest date before and after a specified date from a list of dates in a comma separated string within a pandas Dataframe

Question

tldr; I have an index_date in dtype: datetime64[ns] and a list_of_dates of type with individual elements in str format. What's the best way to convert these to the same data type so I can sort the dates into closest before and closest after index_date?

I have a pandas dataframe (df) with columns:

ID_string                   object
indexdate           datetime64[ns]
XR_count                     int64
CT_count                     int64
studyid_concat              object
studydate_concat            object
modality_concat             object

And it looks something like:

    ID_string   indexdate   XR_count    CT_count    studyid_concat      studydate_concat
0   55555555    2020-09-07  10          1           ['St1', 'St5'...]       ['06/22/2019', '09/20/2020'...]
1   66666666    2020-06-07  5           0           ['St11', 'St17'...]     ['05/22/2020', '06/24/2020'...]

Where the 0 element in studyid_concat ("St1") corresponds to the 0 element in studydate_concat, and in modality_concat, etc. I did not show modality_concat for space reasons, but it's something like ['XR', 'CT', ...]

My current goal is to find the closest X-ray study performed before and after my indexdate, as well as being able to rank studies from closest to furthest. I'm somewhat new to pandas, but here is my current attempt:

df = pd.read_excel(path_to_excel, sheet_name='Sheet1')

# Convert comma separated string from Excel to lists of strings
df.studyid_concat = df.studyid_concat.str.split(',')
df.studydate_concat = df.studydate_concat.str.split(',')
df.modality_concat = df.modality_concat.str.split(',')

for x in in df['ID_string'].values:
    index_date = df.loc[df['ID_string'] == x, 'indexdate']

    # Had to use subscript [0] below because result of above was a list in an array
    studyid_list = df.loc[df['ID_string'] == x, 'studyid_concat'].values[0]
    date_list = df.loc[df['ID_string'] == x, 'studydate_concat'].values[0]
    modality_list = df.loc[df['ID_string'] == x, 'modality_concat'].values[0]

    xr_date_list = [date_list[x] for x in range(len(date_list)) if modality_list[x]=="XR"]
    xr_studyid_list = [studyid_list[x] for x in range(len(studyid_list)) if modality_list[x]=="XR"]

That's about as far as I got because I'm somewhat confused on datatypes here. My indexdate is currently in dtype: datetime64[ns] which I was thinking of converting using the datetime module, but was having a hard time figuring out how. I also wasn't sure if I needed to. My xr_study_list is a list of strings containing dates in format 'mm/dd/yyyy'. I think if I could figure out the rest if I could get the data types in the right format. I'd just compare if the dates are >= or < indexdate to sort into before/after, and then subtract each date by indexdate and sort. I think whatever I do with my xr_date_list, I'd just have to be sure to do the same with xr_studyid_list to keep track of the unique study id

Edit: Desired output dataframe would look like

    ID_string   indexdate   StudyIDBefore           StudyDateBefore     
0   55555555    2020-09-07  ['St33', 'St1', ...]    [2020-09-06, 2019-06-22, ...]
1   66666666    2020-06-07  ['St11', 'St2', ...]    [2020-05-22, 2020-05-01, ...]

Where the "before" variables would be sorted from nearest to furthest, and similar "after columns would exist. My current goal is just to check if a study exists within 3 days before and after this indexdate, but having the above dataframe would give me the flexibility if I need to start looking beyond the nearest study.

Trying to find the nearest date before and after a specified date from a list of dates in a comma separated string within a pandas Dataframe

Answers (1)

Related Questions