Reputation: 110

comparing each row with all other rows in data.frame

I have a tibble, each row corresponds a sample with ID, each sample will have multiple category with values. I want to list the pairs of IDs which the value of two or more category matches to one another.

# A tibble: 3 x 2
ID    data            
<chr> <list>          
1 ID1   <tibble [1 x 3]>
2 ID2   <tibble [1 x 3]>
3 ID3   <tibble [1 x 3]>

I can use sum(match(x$data[[i]], x$data[[j]], nomatch=0)>0) to get the matching counts between two rows.

I can get the answer by putting the match() inside a i and j loops. Is there any better, i.e. tidy way (using tidyverse) to get the answer?

Thanks.

For example, here is the original data.frame:

   ID category value
1 ID1   length   100
2 ID1     type     L
3 ID1    color  Blue
4 ID2   length   100
5 ID2     type     M
6 ID2    color  Blue
7 ID3   length   150
8 ID3     type     M
9 ID3    color  Blue

The output will look like:

  ID.a ID.b matches
1  ID1  ID2       2
2  ID1  ID3       1
3  ID2  ID3       2

I used tidyverse to transform the input data.frame to tibble as shown in the beginning, then using match in a loop.

# A tibble: 3 x 2
ID    data            
<chr> <list>          
1 ID1   <tibble [1 x 3]>
2 ID2   <tibble [1 x 3]>
3 ID3   <tibble [1 x 3]>

Upvotes: 3

Answers (4)

Uwe

Reputation: 42544

For the sake of completeness, here is also a solution which uses a self-join:

library(data.table)
setDT(x)[x, on = .(category, value), allow = TRUE][
  ID < i.ID, .N, by = .(ID1 = ID, ID2 = i.ID)]

   ID1 ID2 N
1: ID1 ID2 2
2: ID2 ID3 2
3: ID1 ID3 1

Data

x <- readr::read_table(
"i  ID category value
1 ID1   length   100
2 ID1     type     L
3 ID1    color  Blue
4 ID2   length   100
5 ID2     type     M
6 ID2    color  Blue
7 ID3   length   150
8 ID3     type     M
9 ID3    color  Blue")[, -1L]

Upvotes: 1

akrun

Reputation: 887078

Here is an option using base R by making use of table and crossprod. Set the lower triangular values of the matrix output of crossprod to NA, convert it to 'long' format by converting to data.frame and then subset the rows that are non-NA for 'Freq' column

out <- with(df, crossprod(table(paste(category, value), ID)))
out[lower.tri(out, diag = TRUE)] <- NA
subset(as.data.frame.table(out), !is.na(Freq))
#    ID ID.1 Freq
#4 ID1  ID2    2
#7 ID1  ID3    1
#8 ID2  ID3    2

data

df <- structure(list(ID = c("ID1", "ID1", "ID1", "ID2", "ID2", "ID2", 
"ID3", "ID3", "ID3"), category = c("length", "type", "color", 
 "length", "type", "color", "length", "type", "color"), 
 value = c("100", 
 "L", "Blue", "100", "M", "Blue", "150", "M", "Blue")), 
 class = "data.frame", row.names = c(NA, -9L))

Upvotes: 2

AntoniosK

Reputation: 16121

df = read.table(text="
ID category value
ID1   length   100
ID1     type     L
ID1    color  Blue
ID2   length   100
ID2     type     M
ID2    color  Blue
ID3   length   150
ID3     type     M
ID3    color  Blue
", header=T, stringsAsFactors = F)

library(tidyverse)

# create a new column that combines category and value
df = df %>% unite(cat_val, category, value, remove = F)

# create vectorised function that counts matches (on that new value)
f = function(x,y) sum(df$cat_val[df$ID == x] == df$cat_val[df$ID == y])
f = Vectorize(f)

data.frame(t(combn(unique(df$ID), 2))) %>%  # create combinations of IDs (as a dataframe)
  mutate(matches = f(X1, X2))               # apply function

#    X1  X2 matches
# 1 ID1 ID2       2
# 2 ID1 ID3       1
# 3 ID2 ID3       2

Upvotes: 1

acylam

Reputation: 18681

Not sure if this is tidier, but we can do something like the following:

library(tidyverse)

combn_join <- function(x) {
  map2(combn(1:3, 2)[1,], combn(1:3, 2)[2,],
       ~ left_join(x[[.x]], x[[.y]], by = c("category", "value")) %>% 
         select(ID.x, ID.y))
}  

df %>%
  split(.$ID) %>%
  combn_join(.) %>%
  do.call(rbind, .) %>%
  filter(!is.na(ID.y)) %>%
  group_by(ID.x, ID.y) %>%
  summarize(matches = n())

Result:

# A tibble: 3 x 3
# Groups:   ID.x [?]
  ID.x  ID.y  matches
  <fct> <fct>   <int>
1 ID1   ID2         2
2 ID1   ID3         1
3 ID2   ID3         2

Data:

df <- structure(list(ID = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 
3L), .Label = c("ID1", "ID2", "ID3"), class = "factor"), category = structure(c(2L, 
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L), .Label = c("color", "length", 
"type"), class = "factor"), value = structure(c(1L, 4L, 3L, 1L, 
5L, 3L, 2L, 5L, 3L), .Label = c("100", "150", "Blue", "L", "M"
), class = "factor")), .Names = c("ID", "category", "value"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9"))

Upvotes: 0

comparing each row with all other rows in data.frame

Answers (4)

Data

data

Related Questions