Reputation: 383
Is there a way to compare whether "any value of" a row is identical to "any value" of the row above -- regardless of the order? Below is a very random input data table.
DT <- data.table(A=c("a","a","b","d","e","f","h","i","j"),
B=c("a","b","c","c","f","g",NA,"j",NA),
C=c("a","b","c","b","g","h",NA,NA,NA))
> DT
A B C
1: a a a
2: a b b
3: b c c
4: d c b
5: e f g
6: f g h
7: h NA NA
8: i j NA
9: j NA NA
I would like to add a column D that compares a row with the row above, and compare whether any values of the two rows are identical (regardless of the order). So the desired output would be:
> DT
A B C D
1: a a a 0 #No row above to compare; could be either NA or 0
2: a b b 1 #row 2 has "a", which is in row 1; returns 1
3: b c c 1 #row 3 has "b", which is in row 2; returns 1
4: d c b 1 #row 4 has "b" and "c", which are in row 3; returns 1
5: e f g 0 #row 5 has nothing that is in row 4; returns 0
6: f g h 1 #row 6 has "f" and "g", which are in row 5; returns 1
7: h NA NA 1 #row 7 has "h", which is in row 6; returns 1
8: i j NA 0 #row 8 has nothing that is in row 7 (NA doesn't count)
9: j NA NA 1 #row 9 has "j", which is in row 8; returns 1 (NA doesn't count)
The main idea is that I would like to compare a row (or a vector) with another row (vector), and define two rows to be identical if any of the elements in each row (vector) are. (without reiterating to compare each element)
Upvotes: 4
Views: 640
Reputation: 38500
Here is another method. It's probably not advisable on large data.tables as it uses by=1:nrow(DT)
which tends to be quite slow.
DT[, D:= sign(DT[, c(.SD, shift(.SD))][,
sum(!is.na(intersect(unlist(.SD[, .(A, B, C)]), unlist(.SD[, .(V4, V5, V6)])))),
by=1:nrow(DT)]$V1)]
Here, [, c(.SD, shift(.SD))]
creates a copy of the data.frame, with the lagged variables included (cbinded). Then the second chain intersects the unlisted variables in the original data.table and the shifted data.table. NAs are assigned 0 and non-NAs are assigned 1 and these results are summed. This operation occurs for each row of the copied data.table. The sum is extracted with $v1
and is turned into binary (0 and 1) using sign
.
It returns
DT
A B C D
1: a a a 0
2: a b b 1
3: b c c 1
4: d c b 1
5: e f g 0
6: f g h 1
7: h NA NA 1
8: i j NA 0
9: j NA NA 1
Upvotes: 3
Reputation: 12937
Here is a base
R solution using intersect
:
res <- c(0, sapply(2:nrow(DT), function(i)
length(intersect( na.omit(as.character(DT[i,])), na.omit(as.character(DT[i-1,])) ) )>0))
cbind(DT, D=res)
# A B C D
# 1: a a a 0
# 2: a b b 1
# 3: b c c 1
# 4: d c b 1
# 5: e f g 0
# 6: f g h 1
# 7: h NA NA 1
# 8: i j NA 0
# 9: j NA NA 1
Upvotes: 2
Reputation: 70256
Here's a loop-free approach using data.table's joins:
DT[, id := 1:.N]
dt <- melt(DT, id.vars = "id")
dt[, id2 := id-1]
dt <- dt[!is.na(value)]
idx <- dt[dt, on = .(id2 = id, value), nomatch=0][, unique(id)]
DT[, `:=`(D = as.integer(id %in% idx), id = NULL)]
It looks somewhat complicated but id does perform pretty well with just over a second for a 1-million-row data set with three columns.
Upvotes: 2
Reputation: 3597
Using a combination of intersect
and mapply
you could do:
#list of unique elements in each row
tableList = apply(DT,1,function(x) unique(na.omit(x)))
#a lagged list to be compared with above list
tableListLag = c(NA,tableList[2:length(tableList)-1])
#find common elements using intersect function
#if length > 0 implies common elements hence set value as 1 else 0
DT$D = mapply(function(x,y) ifelse(length(intersect(x,y))>0,1,0) ,tableList,tableListLag,
SIMPLIFY = TRUE)
DT
# A B C D
#1: a a a 0
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1
Upvotes: 1
Reputation: 12559
This solution compares the two rows with %in%
(after unlist()
):
DT[, result:=as.integer(c(NA, sapply(2:DT[,.N], function(i) any(na.omit(unlist(DT[i])) %in% unlist(DT[i-1])))))]
#> DT
# A B C result
#1: a a a NA
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1
Upvotes: 1
Reputation: 886938
We can do this by getting the lead
rows of the dataset, paste
each row, check for any pattern in with the paste
ed rows of original dataset using grepl
and Map
, then unlist
and convert to integer
DT[, D := {
v1 <- do.call(paste, .SD)
v2 <- do.call(paste, c(shift(.SD, type = "lead"), sep="|"))
v2N <- gsub("NA\\|*|\\|*NA", "", v2)
v3 <- unlist(Map(grepl, v2N, v1), use.names = FALSE)
as.integer(head(c(FALSE, v3), -1))
}]
DT
# A B C D
#1: a a a 0
#2: a b b 1
#3: b c c 1
#4: d c b 1
#5: e f g 0
#6: f g h 1
#7: h NA NA 1
#8: i j NA 0
#9: j NA NA 1
Or we can do a split
and do comparison using Map
as.integer(c(FALSE, unlist(Map(function(x,y) {
x1 <- na.omit(unlist(x))
y1 <- na.omit(unlist(y))
any(x1 %in% y1 | y1 %in% x1) },
split(DT[-nrow(DT)], 1:(nrow(DT)-1)), split(DT[-1], 2:nrow(DT))), use.names = FALSE)))
Upvotes: 4
Reputation: 2274
I would do a sapply along the indices (minus the last) of the table:
compare <- function(i) {
row1 <- as.character(DT[i,])
row2 <- as.character(DT[i+1,])
return(length(intersect(row1[!is.na(row1)], row2[!is.na(row2)])) > 0)
}
result <- sapply(1:(nrow(DT) - 1), compare)
This returns a vector of logicals, so if you prefer to get integers, wrap the output of compare
in a as.numeric()
Upvotes: 2