Reputation: 123
I'm working with some data where I have to get the date of occurrence. For example, say we're working with medical data. Each row is a unique visit from a patient, though the same patient can have multiple rows. Each row also contains info on the type of visit, whether it was routine or emergency room.
I need to go through, and for each row, get the date that the patient was previously admitted to the emergency room, prior to that visit. For example, I'd like to add a column previous_er_discharge_date as below:
visit_id patient_id discharge_date visit_type previous_er_discharge_date
1 abc 2014-05-05 in-patient 2014-05-01
2 abc 2014-05-01 emergency NaT
3 def 2014-04-18 in-patient NaT
4 def 2014-03-12 in-patient 2014-02-12
5 def 2014-02-12 emergency NaT
So I have something working, but it's very slow. I basically just create a separate data frame of only ER visits, and iterate through the main data frame, and finding whether previous ER dates exist for that patient, and if they do I take the first one. (The data is sorted by discharge_date). A general representation of the code I have.
er_visits = main_data[main_data.visit_type=='emergency']
prev_dates = []
for index, row in main_data.iterrows():
dates = er_visits.discharge_date[(er_visits.patient_id==row.patient_id) &
(er_visits.discharge_date < row.discharge_date)].values
if len(dates) > 0:
prev_dates.append(dates[0])
else:
prev_dates.append(pd.NaT)
The above code works, but it's slow, and I was hoping to get help in finding faster ways to do this. The data set I'm working with has several hundred thousand rows, and this is something that has to run everyday.
Thanks!
Upvotes: 6
Views: 7358
Reputation: 76297
In pandas, you basically want to avoid loops, as they kill performance.
Her's a DataFrame similar to yours (I was lazy about the dates, so they're ints; it's the same idea).
df = pd.DataFrame({
'id': ['abc', 'abc', 'def', 'def', 'def'],
'date': [505, 501, 418, 312, 212]})
And here's a function that, for each group, appends the previous date:
def prev_dates(g):
g.sort(columns=['date'])
g['prev'] = g.date.shift(-1)
return g
So all that's needed is to connect things:
>> df.groupby(df.id).apply(prev_dates)
date id prev
0 505 abc 501
1 501 abc NaN
2 418 def 312
3 312 def 212
4 212 def NaN
Edit
As noted by @julius below, sort(columns=
has since been deprecated, and should be replaced by ``sort_values(by=''.
Upvotes: 16
Reputation: 5355
What if you need to find all visits for that patient?
sort[Date, ID]
[nextpatient] = [ID].shift(-1)
[nextvisit] = np.where([ID] == [nextpatient], 1, 0)
[nextdate] = np.where([nextvisit] == 1, [Date].shift(-1), datetime64.nat)
That's my approach (typed on my phone so it's not exact) . I sort and then shift a unique I'd. If that ID matches the ID, then I shift up date. Then I create a column to measure the time between interactions. Also another column to determine what the reason of the visit was, also just another shift.
I wonder if this is a good approach too in terms of speed. I run it about weekly on a 5 million row data set.
Upvotes: 0