NBC
NBC

Reputation: 1698

Partition dataset by timestamp

I have a dataframe of millions of rows like so, with no duplicate time-ID stamps:

ID | Time | Activity 
a  |   1  | Bar 
a  |   3  | Bathroom 
a  |   2  | Bar 
a  |   4  | Bathroom 
a  |   5  | Outside
a  |   6  | Bar 
a  |   7  | Bar

What's the most efficient way to convert it to this format?

ID | StartTime | EndTime | Location 
a  |   1       |    2    |  Bar 
a  |   3       |    4    | Bathroom
a  |   5       |   N/A   | Outside
a  |   6       |   7     | Bar

I have to do this with a lot of data, so wondering how to speed up this process as much as possible.

Upvotes: 0

Views: 52

Answers (1)

BENY
BENY

Reputation: 323316

I am using groupby

df.groupby(['ID','Activity']).Time.apply(list).apply(pd.Series).rename(columns={0:'starttime',1:'endtime'}).reset_index()
Out[251]: 
  ID  Activity  starttime  endtime
0  a       Bar        1.0      2.0
1  a  Bathroom        3.0      4.0
2  a   Outside        5.0      NaN

Or using pivot_table

df.assign(I=df.groupby(['ID','Activity']).cumcount()).pivot_table(index=['ID','Activity'],columns='I',values='Time')
Out[258]: 
I              0    1
ID Activity          
a  Bar       1.0  2.0
   Bathroom  3.0  4.0
   Outside   5.0  NaN

Update

df.assign(I=df.groupby(['ID','Activity']).cumcount()//2).groupby(['ID','Activity','I']).Time.apply(list).apply(pd.Series).rename(columns={0:'starttime',1:'endtime'}).reset_index()
Out[282]: 
  ID  Activity  I  starttime  endtime
0  a       Bar  0        1.0      2.0
1  a       Bar  1        6.0      7.0
2  a  Bathroom  0        3.0      4.0
3  a   Outside  0        5.0      NaN

Upvotes: 1

Related Questions