Reputation: 7830
I have a dataframe with daily observations for individuals. In the example I reduce the dataset to 3 variables, the day and two IDs which identify an individual. An idividual is identified by two IDs but someday one of the IDs could be missing (and re-appear another day), but this is still the same idividual.
For example this is observations for the same idividual :
Day id1 id2
1 Day1 1 x
2 Day2 NA x
3 Day3 1 <NA>
Also, one can find the same individuals every day or they can appear/disappear from one day to the other. (For example the id1 = 4 & id2 = d bellow)
I would like to keep the rows where the couple id1, id2 change at least 1 time over days.
I don't know is this is clear enough so here is dummy datas :
datas <- data.frame(id1 = c(1, 2, 3, 1, 2, 3, 4, NA, 2, 3, 4),
id2 = c(NA, "b", "c", "a", "b", "c", "d", "a", NA, "c", "d"),
date = rep(seq(Sys.Date() - 2, Sys.Date(), by = "day"), each = 4)[-4])
id1 id2 date
1 1 <NA> 2015-08-02
2 2 b 2015-08-02
3 3 c 2015-08-02
4 1 a 2015-08-03
5 2 b 2015-08-03
6 3 c 2015-08-03
7 4 d 2015-08-03
8 NA a 2015-08-04
9 2 <NA> 2015-08-04
10 3 c 2015-08-04
11 4 d 2015-08-04
And the output I'm expecting :
id1 id2 date
1 1 <NA> 2015-08-02
2 2 b 2015-08-02
3 1 a 2015-08-03
4 2 b 2015-08-03
5 NA a 2015-08-04
6 2 <NA> 2015-08-04
I thought I got something with this piece of code :
datas.dt <- data.table(datas, key = c("id1", "id2"))
datas.dt[datas.dt[ , .N, by = c("id1", "id2")][N != length(unique(datas.dt$date))]]
id1 id2 date N
1: NA a 2015-08-04 1
2: 1 NA 2015-08-02 1
3: 1 a 2015-08-03 1
4: 2 NA 2015-08-04 1
5: 2 b 2015-08-02 2
6: 2 b 2015-08-03 2
7: 4 d 2015-08-03 2
8: 4 d 2015-08-04 2
#Or with dplyr
datas.tbl <- tbl_df(datas)
datas.tbl %>%
count(id1, id2) %>%
filter(n != length(unique(datas.tbl$date))) %>%
inner_join(datas.tbl, by = c("id1", "id2"))
But it fails if an individual have no observations for all the days in the dataset (the case id2 = d for example). And for now I have no idea how to achieve this. Any help would be highly appreciated. Thanks !
Upvotes: 0
Views: 85
Reputation: 66819
First, clean up the ids by creating a new one:
idmap = unique( na.omit(datas)[,.(id1,id2)] )[,id0 := .I]
datas[idmap, id0 := i.id0, on="id1"]
datas[idmap, id0 := i.id0, on="id2"]
Then select the groups:
datas[
sort( datas[, .I[ uniqueN(.SD) > 1 ], by=id0, .SDcols=c("id1","id2")]$V1 )
]
This approach is borrowed from @eddi.
The row-number object .I
is key here. The column it generates is named V1
by default. After we have these row numbers, we can subset datas
in the usual way, datas[row_numbers]
.
Upvotes: 4
Reputation: 2589
Is this what you want?
filter_id1 <- setDT(datas)[ , length(unique(id2)), by = id1][ V1 > 1, id1]
filter_id2 <- datas[ , length(unique(id1)), by = id2][ V1 > 1, id2]
datas[ id1 %in% filter_id1 | id2 %in% filter_id2]
output:
id1 id2 date
1: 1 NA 2015-08-02
2: 2 b 2015-08-02
3: 1 a 2015-08-03
4: 2 b 2015-08-03
5: NA a 2015-08-04
6: 2 NA 2015-08-04
Upvotes: 4