Reputation: 85
I am struggling to find the solution for a tracking problem in r.
I have a daily dataframe that consists of rows representing that a package is located somewhere. I have a date
, a package_id
and a location_id
in three columns. Location_id is a numeric variable.
It can happen that a number of days pass and the location_id
does not change. However, I want to create a new dataframe where I can track the change in location_id.
Let us suppose that the the following observations happen:
packake_id location_id date_id
PACK001 123 2018-04-02
PACK001 123 2018-04-03
PACK001 436 2018-04-04
The new object I want to produce should include the the package_id, the "old" location_id, the date of the "last" 123 location_id, and the "new" location_id, and the date of the "first" 436 location_id.
In the example it should be like this:
packake_id old_dat old_location_id new_date new_location_id
PACK001 2018-04-03 123 2018-04-04 436
My first thought was to combine all dataframes into one df by package_id and by that I will get a number of columns with the different dates and locations. Then I can filter out the changes in location by not equal. But it seems too complicated for me.
Thanks for your help!
Upvotes: 0
Views: 890
Reputation: 263301
Here's a self-join strategy that uses a shifted seq-indicator as the join criterion.
dta <- read.table(text="packake_id location_id date_id
PACK001 123 2018-04-02
PACK001 123 2018-04-03
PACK001 436 2018-04-04", header=TRUE,
colClasses=c("character", "character", "Date")
dta$dupes <- with(dta, duplicated( paste(packake_id, location_id,sep="_"),
fromLast=TRUE) )
dta2 <- dta[!dts$dupes, ] # removes non-changing pkg-locations
dta2$pack_seq <- as.numeric(ave(dta2$location_id, dta2$packake_id, FUN= seq_along))
dta2$pack_seq_next <- dta2$pack_seq -1 # the next loc now has same as prior loc
Obviously will need to "clean up" this result:
merge( dta2, dta2, by.x='pack_seq', by.y='pack_seq_next')
pack_seq packake_id.x location_id.x date_id.x dupes.x pack_seq_next packake_id.y
1 1 PACK001 123 2018-04-03 FALSE 0 PACK001
location_id.y date_id.y dupes.y pack_seq
1 436 2018-04-04 FALSE 2
Cleaning up, but leaving the renaming to you:
dta_shifts <- merge( dta2, dta2, by.x='pack_seq', by.y='pack_seq_next')
Warning message:
In merge.data.frame(dta2, dta2, by.x = "pack_seq", by.y = "pack_seq_next") :
column name ‘pack_seq’ is duplicated in the result
dta_shifts <- dta_shifts [, c('packake_id.x', 'date_id.x', 'location_id.x', 'date_id.y' , 'location_id.y')]
#--------
> dta_shifts
packake_id.x date_id.x location_id.x date_id.y location_id.y
1 PACK001 2018-04-03 123 2018-04-04 436
If you had put in a somewhat larger example, I would have included the code needed to do this "by packake_id", but I think it shouldn't be to hard to identify where additional merge criteria are needed. The duplication removal step already has that feature.
Here's some data.table code using @Frank's rleid
suggestion using a somewhat bigger data example::
dta <- read.table(text="packake_id location_id date_id
PACK001 123 2018-04-02
PACK001 123 2018-04-03
PACK001 436 2018-04-04
PACK001 123 2018-04-02
PACK001 123 2018-04-03
PACK001 436 2018-04-04", header=TRUE, colClasses=c("character", "character", "Date"))
with(dta, rleid(packake_id,location_id))
[1] 1 1 2 3 3 4
setDT(dta)
dta[ , seq_id_loc := rleid(packake_id,location_id)]
dta[ !duplicated(seq_id_loc, fromLast=TRUE), ]
#------------
packake_id location_id date_id seq_id_loc
1: PACK001 123 2018-04-03 1
2: PACK001 436 2018-04-04 2
3: PACK001 123 2018-04-03 3
4: PACK001 436 2018-04-04 4
merge( dta2, dta2, by.x='seq_id_loc', by.y='pack_seq_next')
seq_id_loc packake_id.x location_id.x date_id.x pack_seq_next packake_id.y location_id.y
1: 1 PACK001 123 2018-04-03 0 PACK001 436
2: 2 PACK001 436 2018-04-04 1 PACK001 123
3: 3 PACK001 123 2018-04-03 2 PACK001 436
date_id.y seq_id_loc
1: 2018-04-04 2
2: 2018-04-03 3
3: 2018-04-04 4
dta3 <- merge( dta2, dta2, by.x='seq_id_loc', by.y='pack_seq_next')
dta3[ , list(packake_id.x, date_id.x, location_id.x, date_id.y , location_id.y)]
#-----------
packake_id.x date_id.x location_id.x date_id.y location_id.y
1: PACK001 2018-04-03 123 2018-04-04 436
2: PACK001 2018-04-04 436 2018-04-03 123
3: PACK001 2018-04-03 123 2018-04-04 436
Upvotes: 1