anbk
anbk

Reputation: 93

Trying to find the nearest date before and after a specified date from a list of dates in a comma separated string within a pandas Dataframe

tldr; I have an index_date in dtype: datetime64[ns] <class 'pandas.core.series.Series'> and a list_of_dates of type <class 'list'> with individual elements in str format. What's the best way to convert these to the same data type so I can sort the dates into closest before and closest after index_date?

I have a pandas dataframe (df) with columns:

ID_string                   object
indexdate           datetime64[ns]
XR_count                     int64
CT_count                     int64
studyid_concat              object
studydate_concat            object
modality_concat             object

And it looks something like:

    ID_string   indexdate   XR_count    CT_count    studyid_concat      studydate_concat
0   55555555    2020-09-07  10          1           ['St1', 'St5'...]       ['06/22/2019', '09/20/2020'...]
1   66666666    2020-06-07  5           0           ['St11', 'St17'...]     ['05/22/2020', '06/24/2020'...]

Where the 0 element in studyid_concat ("St1") corresponds to the 0 element in studydate_concat, and in modality_concat, etc. I did not show modality_concat for space reasons, but it's something like ['XR', 'CT', ...]

My current goal is to find the closest X-ray study performed before and after my indexdate, as well as being able to rank studies from closest to furthest. I'm somewhat new to pandas, but here is my current attempt:

df = pd.read_excel(path_to_excel, sheet_name='Sheet1')

# Convert comma separated string from Excel to lists of strings
df.studyid_concat = df.studyid_concat.str.split(',')
df.studydate_concat = df.studydate_concat.str.split(',')
df.modality_concat = df.modality_concat.str.split(',')

for x in in df['ID_string'].values:
    index_date = df.loc[df['ID_string'] == x, 'indexdate']

    # Had to use subscript [0] below because result of above was a list in an array
    studyid_list = df.loc[df['ID_string'] == x, 'studyid_concat'].values[0]
    date_list = df.loc[df['ID_string'] == x, 'studydate_concat'].values[0]
    modality_list = df.loc[df['ID_string'] == x, 'modality_concat'].values[0]

    xr_date_list = [date_list[x] for x in range(len(date_list)) if modality_list[x]=="XR"]
    xr_studyid_list = [studyid_list[x] for x in range(len(studyid_list)) if modality_list[x]=="XR"]

That's about as far as I got because I'm somewhat confused on datatypes here. My indexdate is currently in dtype: datetime64[ns] <class 'pandas.core.series.Series'> which I was thinking of converting using the datetime module, but was having a hard time figuring out how. I also wasn't sure if I needed to. My xr_study_list is a list of strings containing dates in format 'mm/dd/yyyy'. I think if I could figure out the rest if I could get the data types in the right format. I'd just compare if the dates are >= or < indexdate to sort into before/after, and then subtract each date by indexdate and sort. I think whatever I do with my xr_date_list, I'd just have to be sure to do the same with xr_studyid_list to keep track of the unique study id

Edit: Desired output dataframe would look like

    ID_string   indexdate   StudyIDBefore           StudyDateBefore     
0   55555555    2020-09-07  ['St33', 'St1', ...]    [2020-09-06, 2019-06-22, ...]
1   66666666    2020-06-07  ['St11', 'St2', ...]    [2020-05-22, 2020-05-01, ...]

Where the "before" variables would be sorted from nearest to furthest, and similar "after columns would exist. My current goal is just to check if a study exists within 3 days before and after this indexdate, but having the above dataframe would give me the flexibility if I need to start looking beyond the nearest study.

Upvotes: 1

Views: 215

Answers (1)

anbk
anbk

Reputation: 93

Think I found my own answer after spending some time thinking about it some more and referencing more of pandas to_datetime documentation. Basically realized I could convert my list of string dates using pd.to_datetime

date_list = pd.to_datetime(df.loc[df['ID_string'] == x, 'studydate_concat'].values[0]).values

Then could subtract my index date from this list. Opted to do this within a temporary dataframe so I could keep track of the other column values (like study ID, modality, etc.).

Full code is below:

for x in df['ID_string'].values:
    index_date = df.loc[df['ID_string'] == x, 'indexdate'].values[0]
    date_list = pd.to_datetime(df.loc[df['ID_string'] == x, 'studydate_concat'].values[0]).values
    modality_list = df.loc[df['ID_string'] == x, 'modality_concat'].values[0]
    studyid_list = df.loc[df['ID_string'] == x, '_concat'].values[0]

    tempdata = list(zip(studyid_list, date_list, modality_list))
    tempdf = pd.DataFrame(tempdata, columns=['studyid', 'studydate', 'modality'])

    tempdf['indexdate'] = index_date
    tempdf['timedelta'] = tempdf['studydate']-tempdf['index_date']

    tempdf['study_done_wi_3daysbefore'] = np.where((tempdf['timedelta']>=np.timedelta64(-3,'D')) & (tempdf['timedelta']<np.timedelta64(0,'D')), True, False)
    tempdf['study_done_wi_3daysafter'] = np.where((tempdf['timedelta']<=np.timedelta64(3,'D')) & (tempdf['timedelta']>=np.timedelta64(0,'D')), True, False)
    tempdf['study_done_onindex'] = np.where(tempdf['timedelta']==np.timedelta64(0,'D'), True, False)

    XRonindex[x] = True if len(tempdf.loc[(tempdf['study_done_onindex']==True) & (tempdf['modality']=='XR'), 'studyid'])>0 else False
    XRwi3days[x] = True if len(tempdf.loc[(tempdf['study_done_wi_3daysbefore']==True) & (tempdf['modality']=='XR'), 'studyid'])>0 else False
    # can later map these values back to my original dataframe as a new column

Upvotes: 0

Related Questions