hulio_entredas
hulio_entredas

Reputation: 671

Collapse two rows in stata into one that share values in different columns

I have whittled down a Stata dataset into something like this example of two rows:

id   location     start_datetime         end_datetime
1    64^6418    25feb2009 15:45:54    25feb2009 15:58:54
1    64^6418    25feb2009 15:58:54    09mary2009 15:16:06

As you can see the end_datetime of one row for the same id and location is the exact same as the start_datetime of the next row. I'd like to collapse rows that fit this criteria so that they are just one row, with the start_datetime of the first row and the end_datetime of the second row. For example, I would like rows that look like the above to be transformed to this

id   location     start_datetime         end_datetime
1    64^6418    25feb2009 15:45:54    09mary2009 15:16:06

I have accomplished this change with the below code, but it is sort of hack-y and I wonder if there is a better way to do this in Stata:

// Sort by id and start_datetime
sort id location start_datetime

//Check if next start_datetime is the same as current end_datetime
by id (location): gen same_as_next = start_datetime[_n+1] == end_datetime & location[_n+1] == location

//Create flag variable that switches to 1 whenever there is a change in the location or a non-contiguous stays
gen change_flag = same_as_next == 0 if same_as_next != .

//Fill in missing values with 0
replace change_flag = 0 if change_flag == .

* Revisit this step to debug
//Generate groups of contiguous stays
by id (location): gen group = 1 + sum(change_flag)

//Group values are skewed by 1 row so create temp var to store next group value
by id (location): gen temp_group = group[_n-1]

//Force row 1 to be group 1 
replace temp_group = 1 if _n == 1

replace temp_group = 1 if temp_group == .

collapse (min) start_datetime (max) end_datetime, by(id location temp_group)

drop temp_group

Upvotes: 0

Views: 320

Answers (1)

hulio_entredas
hulio_entredas

Reputation: 671

This works as a sort of simpler approach:

// Generate new indicator variable if end_datetime of current row is the same as the start_datetime of the next row
by id (location): gen same_as_next = start_datetime[_n+1] == end_datetime & location[_n+1] == location & id[_n+1] == id

// Generate a new indicator variable if start_datetime of current row is the same as end_datetime of previous row
by id (location): gen same_as_prev = end_datetime[_n-1] == start_datetime & location[_n-1] == location & id[_n-1] == id

// Generate indicator variable for id and location that have both same_as_next and same_as_prev 1
by id(location): gen same_same = same_as_next  == 1 | same_as_prev == 1

// Collapse rows with same_same == 1 keeping the minimum start_datetime and maximum end_datetime
collapse (min) start_datetime (max) end_datetime, by(id location same_same)

// Drop same_same column
drop same_same

Upvotes: 0

Related Questions