Reputation: 93
tldr; I have an index_date
in dtype: datetime64[ns] <class 'pandas.core.series.Series'>
and a list_of_dates
of type <class 'list'>
with individual elements in str
format. What's the best way to convert these to the same data type so I can sort the dates into closest before and closest after index_date
?
I have a pandas dataframe (df) with columns:
ID_string object
indexdate datetime64[ns]
XR_count int64
CT_count int64
studyid_concat object
studydate_concat object
modality_concat object
And it looks something like:
ID_string indexdate XR_count CT_count studyid_concat studydate_concat
0 55555555 2020-09-07 10 1 ['St1', 'St5'...] ['06/22/2019', '09/20/2020'...]
1 66666666 2020-06-07 5 0 ['St11', 'St17'...] ['05/22/2020', '06/24/2020'...]
Where the 0 element in studyid_concat ("St1") corresponds to the 0 element in studydate_concat, and in modality_concat, etc. I did not show modality_concat for space reasons, but it's something like ['XR', 'CT', ...]
My current goal is to find the closest X-ray study performed before and after my indexdate, as well as being able to rank studies from closest to furthest. I'm somewhat new to pandas, but here is my current attempt:
df = pd.read_excel(path_to_excel, sheet_name='Sheet1')
# Convert comma separated string from Excel to lists of strings
df.studyid_concat = df.studyid_concat.str.split(',')
df.studydate_concat = df.studydate_concat.str.split(',')
df.modality_concat = df.modality_concat.str.split(',')
for x in in df['ID_string'].values:
index_date = df.loc[df['ID_string'] == x, 'indexdate']
# Had to use subscript [0] below because result of above was a list in an array
studyid_list = df.loc[df['ID_string'] == x, 'studyid_concat'].values[0]
date_list = df.loc[df['ID_string'] == x, 'studydate_concat'].values[0]
modality_list = df.loc[df['ID_string'] == x, 'modality_concat'].values[0]
xr_date_list = [date_list[x] for x in range(len(date_list)) if modality_list[x]=="XR"]
xr_studyid_list = [studyid_list[x] for x in range(len(studyid_list)) if modality_list[x]=="XR"]
That's about as far as I got because I'm somewhat confused on datatypes here. My indexdate is currently in dtype: datetime64[ns] <class 'pandas.core.series.Series'>
which I was thinking of converting using the datetime
module, but was having a hard time figuring out how. I also wasn't sure if I needed to. My xr_study_list
is a list of strings containing dates in format 'mm/dd/yyyy'. I think if I could figure out the rest if I could get the data types in the right format. I'd just compare if the dates are >= or < indexdate
to sort into before/after, and then subtract each date by indexdate
and sort. I think whatever I do with my xr_date_list
, I'd just have to be sure to do the same with xr_studyid_list
to keep track of the unique study id
Edit: Desired output dataframe would look like
ID_string indexdate StudyIDBefore StudyDateBefore
0 55555555 2020-09-07 ['St33', 'St1', ...] [2020-09-06, 2019-06-22, ...]
1 66666666 2020-06-07 ['St11', 'St2', ...] [2020-05-22, 2020-05-01, ...]
Where the "before" variables would be sorted from nearest to furthest, and similar "after columns would exist. My current goal is just to check if a study exists within 3 days before and after this indexdate, but having the above dataframe would give me the flexibility if I need to start looking beyond the nearest study.
Upvotes: 1
Views: 215
Reputation: 93
Think I found my own answer after spending some time thinking about it some more and referencing more of pandas to_datetime documentation. Basically realized I could convert my list of string dates using pd.to_datetime
date_list = pd.to_datetime(df.loc[df['ID_string'] == x, 'studydate_concat'].values[0]).values
Then could subtract my index date from this list. Opted to do this within a temporary dataframe so I could keep track of the other column values (like study ID, modality, etc.).
Full code is below:
for x in df['ID_string'].values:
index_date = df.loc[df['ID_string'] == x, 'indexdate'].values[0]
date_list = pd.to_datetime(df.loc[df['ID_string'] == x, 'studydate_concat'].values[0]).values
modality_list = df.loc[df['ID_string'] == x, 'modality_concat'].values[0]
studyid_list = df.loc[df['ID_string'] == x, '_concat'].values[0]
tempdata = list(zip(studyid_list, date_list, modality_list))
tempdf = pd.DataFrame(tempdata, columns=['studyid', 'studydate', 'modality'])
tempdf['indexdate'] = index_date
tempdf['timedelta'] = tempdf['studydate']-tempdf['index_date']
tempdf['study_done_wi_3daysbefore'] = np.where((tempdf['timedelta']>=np.timedelta64(-3,'D')) & (tempdf['timedelta']<np.timedelta64(0,'D')), True, False)
tempdf['study_done_wi_3daysafter'] = np.where((tempdf['timedelta']<=np.timedelta64(3,'D')) & (tempdf['timedelta']>=np.timedelta64(0,'D')), True, False)
tempdf['study_done_onindex'] = np.where(tempdf['timedelta']==np.timedelta64(0,'D'), True, False)
XRonindex[x] = True if len(tempdf.loc[(tempdf['study_done_onindex']==True) & (tempdf['modality']=='XR'), 'studyid'])>0 else False
XRwi3days[x] = True if len(tempdf.loc[(tempdf['study_done_wi_3daysbefore']==True) & (tempdf['modality']=='XR'), 'studyid'])>0 else False
# can later map these values back to my original dataframe as a new column
Upvotes: 0