Reputation: 107
I have the following dataset:
df <- data.frame(c(1,1,1,2,2,2,2,3,3,3,3,4,4,4,5,5,5), c("a","a","a","b","b","b","b","b","b","b","b",
"a","a","a","b","b","b"),
c(300,295,295,25,25,25,25,25,20,20,20,300,295,295,300, 295,295),
c("c","d","e","f","g","h","i","j","l","m","n","o","p","q","r","s","t"))
colnames(df) <- c("ID", "Group", "Price", "OtherNumber")
> df
ID Group Price OtherNumber
1 1 a 300 c
2 1 a 295 d
3 1 a 295 e
4 2 b 25 f
5 2 b 25 g
6 2 b 25 h
7 2 b 25 i
8 3 b 25 j
9 3 b 20 l
10 3 b 20 m
11 3 b 20 n
12 4 a 300 o
13 4 a 295 p
14 4 a 295 q
15 5 b 300 r
16 5 b 295 s
17 5 b 295 t
I want to compare the first price of subsequent IDs. Only if the two subsequent IDs have the same initial price and are in the same group, I want to flag them. Just in case this was not very clear, here an example: I compare the first and second ID, but both the group (a vs. b) and the initial price is a mismatch (300 vs. 25). On the other hand, between ID 2 and 3, they are both in group b and have the same initial price of 25 (cf. row 4 and 8). The prices afterwards do not really matter as they may differ.
I figure, I must be able to work with the dplyr package and have determined a very rough solution (which does not yet work).
# Load dplyr
library(dplyr)
# Assign row numbers within IDs
df1 <- df %>%
group_by(ID) %>%
mutate(subID = row_number())
# Isolate first observation in ID
df2 <- df1[df1$subID == 1,]
# Set up loop to iterate through IDs
for (i in 2:length(df2)) {
if (df2$Price[i] - df2$Price[i - 1] == 0) {
df2$flag <- TRUE
} else {
df2$flag <- FALSE
}
}
If you tell me that this is the only possible solution, I will obviously devote more resources to it, but I am sure there must be an easier solution. I checked on SO and maybe I missed something, but I was not able to find anything going into this direction. Thanks!
The output I want to get looks something like this:
ID Group Price OtherNumber flag
1 1 a 300 c FALSE
2 1 a 295 d FALSE
3 1 a 295 e FALSE
4 2 b 25 f TRUE
5 2 b 25 g TRUE
6 2 b 25 h TRUE
7 2 b 25 i TRUE
8 3 b 25 j TRUE
9 3 b 20 l TRUE
10 3 b 20 m TRUE
11 3 b 20 n TRUE
12 4 a 300 o FALSE
13 4 a 295 p FALSE
14 4 a 295 q FALSE
15 5 b 300 r FALSE
16 5 b 295 s FALSE
17 5 b 295 t FALSE
Upvotes: 0
Views: 487
Reputation: 27732
Here is a data.table oneliner... cut into smaller pieces to view intermediate results; also see explanation at the bottom of the answer.
dt <- as.data.table( df )
dt[ dt[ , .SD[1], ID][ ( Group == shift( Group, type = "lead") & Price == shift( Price, type = "lead") ) |
( Group == shift( Group, type = "lag") & Price == shift( Price, type = "lag),
flag := TRUE][is.na(flag), flag := FALSE], flag := i.flag, on = .(ID)][]
# ID Group Price OtherNumber flag
# 1: 1 a 300 c FALSE
# 2: 1 a 295 d FALSE
# 3: 1 a 295 e FALSE
# 4: 2 b 25 f TRUE
# 5: 2 b 25 g TRUE
# 6: 2 b 25 h TRUE
# 7: 2 b 25 i TRUE
# 8: 3 b 25 j TRUE
# 9: 3 b 20 l TRUE
# 10: 3 b 20 m TRUE
# 11: 3 b 20 n TRUE
# 12: 4 a 300 o FALSE
# 13: 4 a 295 p FALSE
# 14: 4 a 295 q FALSE
# 15: 5 b 300 r FALSE
# 16: 5 b 295 s FALSE
# 17: 5 b 295 t FALSE
explanation:
dt[ , .SD[1], ID]
create a data.table with the first row of each ID
[ Group == shift( ... , flag := TRUE]
sets the column flag
to TRUE
when the next (or previous) row has matching Price
and Group
.
[is.na(flag), flag := FALSE]
fills in the rest (which is not TRUE) with `FALSE
..flag := i.flag, on = .(ID)]
performs a left join (by reference, so it's fast and efficient) on the original data.table, to get the final result.
Upvotes: 1