wyatt
wyatt

Reputation: 383

How to compare if any of the elements in a row is same

Is there a way to compare whether "any value of" a row is identical to "any value" of the row above -- regardless of the order? Below is a very random input data table.

DT <- data.table(A=c("a","a","b","d","e","f","h","i","j"),
                 B=c("a","b","c","c","f","g",NA,"j",NA),
                 C=c("a","b","c","b","g","h",NA,NA,NA))

> DT
   A  B  C
1: a  a  a
2: a  b  b
3: b  c  c
4: d  c  b
5: e  f  g
6: f  g  h
7: h NA NA
8: i  j NA
9: j NA NA

I would like to add a column D that compares a row with the row above, and compare whether any values of the two rows are identical (regardless of the order). So the desired output would be:

 > DT
   A  B  C  D
1: a  a  a  0 #No row above to compare; could be either NA or 0
2: a  b  b  1 #row 2 has "a", which is in row 1; returns 1
3: b  c  c  1 #row 3 has "b", which is in row 2; returns 1
4: d  c  b  1 #row 4 has "b" and "c", which are in row 3; returns 1
5: e  f  g  0 #row 5 has nothing that is in row 4; returns 0
6: f  g  h  1 #row 6 has "f" and "g", which are in row 5; returns 1
7: h NA NA  1 #row 7 has "h", which is in row 6; returns 1
8: i  j NA  0 #row 8 has nothing that is in row 7 (NA doesn't count)
9: j NA NA  1 #row 9 has "j", which is in row 8; returns 1 (NA doesn't count)

The main idea is that I would like to compare a row (or a vector) with another row (vector), and define two rows to be identical if any of the elements in each row (vector) are. (without reiterating to compare each element)

Upvotes: 4

Views: 640

Answers (7)

lmo
lmo

Reputation: 38500

Here is another method. It's probably not advisable on large data.tables as it uses by=1:nrow(DT) which tends to be quite slow.

DT[, D:= sign(DT[, c(.SD, shift(.SD))][,
   sum(!is.na(intersect(unlist(.SD[, .(A, B, C)]), unlist(.SD[, .(V4, V5, V6)])))),
   by=1:nrow(DT)]$V1)]

Here, [, c(.SD, shift(.SD))] creates a copy of the data.frame, with the lagged variables included (cbinded). Then the second chain intersects the unlisted variables in the original data.table and the shifted data.table. NAs are assigned 0 and non-NAs are assigned 1 and these results are summed. This operation occurs for each row of the copied data.table. The sum is extracted with $v1 and is turned into binary (0 and 1) using sign.

It returns

DT
   A  B  C D
1: a  a  a 0
2: a  b  b 1
3: b  c  c 1
4: d  c  b 1
5: e  f  g 0
6: f  g  h 1
7: h NA NA 1
8: i  j NA 0
9: j NA NA 1

Upvotes: 3

989
989

Reputation: 12937

Here is a base R solution using intersect:

res <- c(0, sapply(2:nrow(DT), function(i) 
  length(intersect( na.omit(as.character(DT[i,])), na.omit(as.character(DT[i-1,])) ) )>0))

cbind(DT, D=res)
   # A  B  C D
# 1: a  a  a 0
# 2: a  b  b 1
# 3: b  c  c 1
# 4: d  c  b 1
# 5: e  f  g 0
# 6: f  g  h 1
# 7: h NA NA 1
# 8: i  j NA 0
# 9: j NA NA 1

Upvotes: 2

talat
talat

Reputation: 70256

Here's a loop-free approach using data.table's joins:

DT[, id := 1:.N]
dt <- melt(DT, id.vars = "id")
dt[, id2 := id-1]
dt <- dt[!is.na(value)]
idx <- dt[dt, on = .(id2 = id, value), nomatch=0][, unique(id)]
DT[, `:=`(D = as.integer(id %in% idx), id = NULL)]

It looks somewhat complicated but id does perform pretty well with just over a second for a 1-million-row data set with three columns.

Upvotes: 2

Silence Dogood
Silence Dogood

Reputation: 3597

Using a combination of intersect and mapply you could do:

#list of unique elements in each row
tableList = apply(DT,1,function(x) unique(na.omit(x)))

#a lagged list to be compared with above list
tableListLag = c(NA,tableList[2:length(tableList)-1])

#find common elements using intersect function
#if length > 0 implies common elements hence set value as 1 else 0
DT$D = mapply(function(x,y) ifelse(length(intersect(x,y))>0,1,0) ,tableList,tableListLag,
             SIMPLIFY = TRUE)


DT
#   A  B  C D
#1: a  a  a 0
#2: a  b  b 1
#3: b  c  c 1
#4: d  c  b 1
#5: e  f  g 0
#6: f  g  h 1
#7: h NA NA 1
#8: i  j NA 0
#9: j NA NA 1

Upvotes: 1

jogo
jogo

Reputation: 12559

This solution compares the two rows with %in% (after unlist()):

DT[, result:=as.integer(c(NA, sapply(2:DT[,.N], function(i) any(na.omit(unlist(DT[i])) %in% unlist(DT[i-1])))))]
#> DT
#   A  B  C result
#1: a  a  a     NA
#2: a  b  b      1
#3: b  c  c      1
#4: d  c  b      1
#5: e  f  g      0
#6: f  g  h      1
#7: h NA NA      1
#8: i  j NA      0
#9: j NA NA      1

Upvotes: 1

akrun
akrun

Reputation: 886938

We can do this by getting the lead rows of the dataset, paste each row, check for any pattern in with the pasteed rows of original dataset using grepl and Map, then unlist and convert to integer

DT[, D := {
     v1 <- do.call(paste, .SD)
     v2 <- do.call(paste, c(shift(.SD, type = "lead"), sep="|"))
     v2N <- gsub("NA\\|*|\\|*NA", "", v2)
     v3 <- unlist(Map(grepl, v2N, v1), use.names = FALSE)
     as.integer(head(c(FALSE, v3), -1))        
}]

DT
#   A  B  C D
#1: a  a  a 0
#2: a  b  b 1
#3: b  c  c 1
#4: d  c  b 1
#5: e  f  g 0
#6: f  g  h 1
#7: h NA NA 1
#8: i  j NA 0
#9: j NA NA 1

Or we can do a split and do comparison using Map

as.integer(c(FALSE, unlist(Map(function(x,y) {
     x1 <- na.omit(unlist(x))
     y1 <- na.omit(unlist(y))
    any(x1 %in% y1 | y1 %in% x1)  },
     split(DT[-nrow(DT)], 1:(nrow(DT)-1)), split(DT[-1], 2:nrow(DT))), use.names = FALSE)))

Upvotes: 4

PlasmaBinturong
PlasmaBinturong

Reputation: 2274

I would do a sapply along the indices (minus the last) of the table:

compare <- function(i) {
    row1 <- as.character(DT[i,])
    row2 <- as.character(DT[i+1,])
    return(length(intersect(row1[!is.na(row1)], row2[!is.na(row2)])) > 0)
}

result <- sapply(1:(nrow(DT) - 1), compare)

This returns a vector of logicals, so if you prefer to get integers, wrap the output of compare in a as.numeric()

Upvotes: 2

Related Questions