Reputation: 263
I have a dataframe df
:
id year groupid
1 A1 2000 G1
2 A1 2000 G1
3 A1 2000 G1
4 A2 2000 G2
5 A1 2001 G1
6 A12 2001 G1
7 A13 2001 G1
8 A3 2001 G2
9 A33 2001 G2
10 A4 2001 G3
11 A4 2002 G3
12 A5 2002 G3
13 A5 2003 G2
14 A6 2003 G4
What I would like to do is to use setdiff()
between values with the same groupid
of two consecutive years.
Example :
For the year 2000, G1
has one id
: A1
. For the year 20001, G1
has three different id
: A1
, A12
, A13
. So when setdiff()
is applied between those two, it will return 2
. Between the year 2001 and 2002, since G1
is not present in the year 2002, the value 0
will be given. For the same groupid
, if two years are not consecutive, the value given will be either 0
if the group is not present in the second year or the number of the different id
.
Expected results :
year groupid newid
1 2000 G1 1
2 2000 G2 1
3 2000 G3 0
4 2000 G4 0
5 2001 G1 2
6 2001 G2 2
7 2001 G3 1
8 2001 G4 0
9 2002 G1 0
10 2002 G2 0
11 2002 G3 2
12 2002 G4 0
13 2003 G1 0
14 2003 G2 1
15 2003 G3 0
16 2003 G4 1
I have done this with a for loop
, some if()
and dplyr
functions but since I have a lot of rows, it took a bit too much time (around 5 minutes). So I'm searching to replace the loop by some dplyr
functions or data.table
functions to do this task if less time.
Data :
structure(list(id = c("A1", "A1", "A1", "A2", "A1", "A12", "A13",
"A3", "A33", "A4", "A4", "A5", "A5", "A6"), year = c(2000, 2000,
2000, 2000, 2001, 2001, 2001, 2001, 2001, 2001, 2002, 2002, 2003,
2003), groupid = c("G1", "G1", "G1", "G2", "G1", "G1", "G1",
"G2", "G2", "G3", "G3", "G3", "G2", "G4")), .Names = c("id",
"year", "groupid"), row.names = c(NA, -14L), class = "data.frame")
EDIT : Modified the example
Upvotes: 1
Views: 853
Reputation: 12819
This solution is a combination of my initial one using tidyr
and dplyr
, and of @jogo
's (now deleted) answer using base R aggregate(drop = FALSE)
:
df <- read.table(header = T, stringsAsFactors = F, text =
" id year groupid
1 A1 2000 G1
2 A1 2000 G1
3 A1 2000 G1
4 A2 2000 G2
5 A1 2001 G1
6 A12 2001 G1
7 A13 2001 G1
8 A3 2001 G2
9 A33 2001 G2
10 A4 2001 G3
11 A4 2002 G3
12 A5 2002 G3
13 A5 2003 G2
14 A6 2003 G4")
library(dplyr)
df %>%
aggregate(id ~ year + groupid, ., unique, drop = FALSE) %>%
group_by(groupid) %>%
arrange(year) %>%
mutate(new_ids = mapply(setdiff, id, lag(id)),
newid = lapply(new_ids, length)) %>%
ungroup() %>%
arrange(year, groupid) %>%
as.data.frame()
# year groupid id new_ids newid
# 1 2000 G1 A1 A1 1
# 2 2000 G2 A2 A2 1
# 3 2000 G3 0
# 4 2000 G4 0
# 5 2001 G1 A1, A12, A13 A12, A13 2
# 6 2001 G2 A3, A33 A3, A33 2
# 7 2001 G3 A4 A4 1
# 8 2001 G4 0
# 9 2002 G1 0
# 10 2002 G2 0
# 11 2002 G3 A4, A5 A5 1
# 12 2002 G4 0
# 13 2003 G1 0
# 14 2003 G2 A5 A5 1
# 15 2003 G3 0
# 16 2003 G4 A6 A6 1
Edit: explanations, answer to comment:
The , .,
part is to say df
should be the second argument to aggregate()
, not the first (see help("%>%")
). But really it was to make it shorter, you'd be better off using a longer and more robust form with named arguments, like: aggregate(formula = id ~ year + groupid, data = ., FUN = unique, drop = FALSE)
.
The added group_by()
is necessary to make the use of lag()
consistent. Without groups, we'd take the risk that the ids of 2000, G2
are compared to those of 2003, G1
(consecutive lines after aggregate()
). What we really want is setdiff(..., lag(...
to happen inside a groupid
group, with rows ordered by year
(hence the probably unnecessary arrange(year)
). The lack of this group_by()
makes no difference on this example, but would probably make one on your real life data.
Upvotes: 4