Tilsight
Tilsight

Reputation: 85

How to track a change of a data in r?

I am struggling to find the solution for a tracking problem in r.

I have a daily dataframe that consists of rows representing that a package is located somewhere. I have a date, a package_id and a location_id in three columns. Location_id is a numeric variable.

It can happen that a number of days pass and the location_id does not change. However, I want to create a new dataframe where I can track the change in location_id.

Let us suppose that the the following observations happen:

packake_id location_id date_id
PACK001    123         2018-04-02
PACK001    123         2018-04-03
PACK001    436         2018-04-04

The new object I want to produce should include the the package_id, the "old" location_id, the date of the "last" 123 location_id, and the "new" location_id, and the date of the "first" 436 location_id.

In the example it should be like this:

packake_id old_dat    old_location_id new_date    new_location_id
PACK001    2018-04-03 123             2018-04-04  436

My first thought was to combine all dataframes into one df by package_id and by that I will get a number of columns with the different dates and locations. Then I can filter out the changes in location by not equal. But it seems too complicated for me.

Thanks for your help!

Upvotes: 0

Views: 890

Answers (1)

IRTFM
IRTFM

Reputation: 263301

Here's a self-join strategy that uses a shifted seq-indicator as the join criterion.

dta <- read.table(text="packake_id location_id date_id
PACK001    123         2018-04-02
PACK001    123         2018-04-03
PACK001    436         2018-04-04", header=TRUE, 
                                    colClasses=c("character", "character", "Date")

dta$dupes <- with(dta, duplicated( paste(packake_id, location_id,sep="_"), 
                          fromLast=TRUE) )
dta2 <- dta[!dts$dupes, ] # removes non-changing pkg-locations

dta2$pack_seq <- as.numeric(ave(dta2$location_id, dta2$packake_id, FUN= seq_along))
dta2$pack_seq_next <- dta2$pack_seq  -1  # the next loc now has same as prior loc

Obviously will need to "clean up" this result:

 merge( dta2, dta2, by.x='pack_seq', by.y='pack_seq_next') 
  pack_seq packake_id.x location_id.x  date_id.x dupes.x pack_seq_next packake_id.y
1        1      PACK001           123 2018-04-03   FALSE             0      PACK001
  location_id.y  date_id.y dupes.y pack_seq
1           436 2018-04-04   FALSE        2

Cleaning up, but leaving the renaming to you:

dta_shifts <- merge( dta2, dta2, by.x='pack_seq', by.y='pack_seq_next') 
Warning message:
In merge.data.frame(dta2, dta2, by.x = "pack_seq", by.y = "pack_seq_next") :
  column name ‘pack_seq’ is duplicated in the result

dta_shifts <- dta_shifts [, c('packake_id.x', 'date_id.x', 'location_id.x', 'date_id.y' ,   'location_id.y')]
#--------
> dta_shifts
  packake_id.x  date_id.x location_id.x  date_id.y location_id.y
1      PACK001 2018-04-03           123 2018-04-04           436

If you had put in a somewhat larger example, I would have included the code needed to do this "by packake_id", but I think it shouldn't be to hard to identify where additional merge criteria are needed. The duplication removal step already has that feature.

Here's some data.table code using @Frank's rleid suggestion using a somewhat bigger data example::

dta <- read.table(text="packake_id location_id date_id
PACK001    123         2018-04-02
PACK001    123         2018-04-03
PACK001    436         2018-04-04
PACK001    123         2018-04-02
PACK001    123         2018-04-03
PACK001    436         2018-04-04", header=TRUE, colClasses=c("character", "character", "Date"))
with(dta, rleid(packake_id,location_id))
[1] 1 1 2 3 3 4
setDT(dta)
dta[ , seq_id_loc  := rleid(packake_id,location_id)]
dta[ !duplicated(seq_id_loc, fromLast=TRUE), ]
#------------
   packake_id location_id    date_id seq_id_loc
1:    PACK001         123 2018-04-03          1
2:    PACK001         436 2018-04-04          2
3:    PACK001         123 2018-04-03          3
4:    PACK001         436 2018-04-04          4
merge( dta2, dta2, by.x='seq_id_loc', by.y='pack_seq_next')
   seq_id_loc packake_id.x location_id.x  date_id.x pack_seq_next packake_id.y location_id.y
1:          1      PACK001           123 2018-04-03             0      PACK001           436
2:          2      PACK001           436 2018-04-04             1      PACK001           123
3:          3      PACK001           123 2018-04-03             2      PACK001           436
    date_id.y seq_id_loc
1: 2018-04-04          2
2: 2018-04-03          3
3: 2018-04-04          4

 dta3 <- merge( dta2, dta2, by.x='seq_id_loc', by.y='pack_seq_next')
 dta3[ ,  list(packake_id.x, date_id.x, location_id.x, date_id.y ,   location_id.y)]
#-----------
   packake_id.x  date_id.x location_id.x  date_id.y location_id.y
1:      PACK001 2018-04-03           123 2018-04-04           436
2:      PACK001 2018-04-04           436 2018-04-03           123
3:      PACK001 2018-04-03           123 2018-04-04           436

Upvotes: 1

Related Questions