Reputation: 110
I have a tibble, each row corresponds a sample with ID, each sample will have multiple category with values. I want to list the pairs of IDs which the value of two or more category matches to one another.
# A tibble: 3 x 2
ID data
<chr> <list>
1 ID1 <tibble [1 x 3]>
2 ID2 <tibble [1 x 3]>
3 ID3 <tibble [1 x 3]>
I can use sum(match(x$data[[i]], x$data[[j]], nomatch=0)>0)
to get the matching counts between two rows.
I can get the answer by putting the match()
inside a i
and j
loops. Is there any better, i.e. tidy way (using tidyverse
) to get the answer?
Thanks.
For example, here is the original data.frame:
ID category value
1 ID1 length 100
2 ID1 type L
3 ID1 color Blue
4 ID2 length 100
5 ID2 type M
6 ID2 color Blue
7 ID3 length 150
8 ID3 type M
9 ID3 color Blue
The output will look like:
ID.a ID.b matches
1 ID1 ID2 2
2 ID1 ID3 1
3 ID2 ID3 2
I used tidyverse
to transform the input data.frame to tibble as shown in the beginning, then using match
in a loop.
# A tibble: 3 x 2
ID data
<chr> <list>
1 ID1 <tibble [1 x 3]>
2 ID2 <tibble [1 x 3]>
3 ID3 <tibble [1 x 3]>
Upvotes: 3
Views: 1568
Reputation: 42544
For the sake of completeness, here is also a solution which uses a self-join:
library(data.table)
setDT(x)[x, on = .(category, value), allow = TRUE][
ID < i.ID, .N, by = .(ID1 = ID, ID2 = i.ID)]
ID1 ID2 N 1: ID1 ID2 2 2: ID2 ID3 2 3: ID1 ID3 1
x <- readr::read_table(
"i ID category value
1 ID1 length 100
2 ID1 type L
3 ID1 color Blue
4 ID2 length 100
5 ID2 type M
6 ID2 color Blue
7 ID3 length 150
8 ID3 type M
9 ID3 color Blue")[, -1L]
Upvotes: 1
Reputation: 887078
Here is an option using base R
by making use of table
and crossprod
. Set the lower triangular values of the matrix output of crossprod
to NA
, convert it to 'long' format by converting to data.frame
and then subset
the rows that are non-NA for 'Freq' column
out <- with(df, crossprod(table(paste(category, value), ID)))
out[lower.tri(out, diag = TRUE)] <- NA
subset(as.data.frame.table(out), !is.na(Freq))
# ID ID.1 Freq
#4 ID1 ID2 2
#7 ID1 ID3 1
#8 ID2 ID3 2
df <- structure(list(ID = c("ID1", "ID1", "ID1", "ID2", "ID2", "ID2",
"ID3", "ID3", "ID3"), category = c("length", "type", "color",
"length", "type", "color", "length", "type", "color"),
value = c("100",
"L", "Blue", "100", "M", "Blue", "150", "M", "Blue")),
class = "data.frame", row.names = c(NA, -9L))
Upvotes: 2
Reputation: 16121
df = read.table(text="
ID category value
ID1 length 100
ID1 type L
ID1 color Blue
ID2 length 100
ID2 type M
ID2 color Blue
ID3 length 150
ID3 type M
ID3 color Blue
", header=T, stringsAsFactors = F)
library(tidyverse)
# create a new column that combines category and value
df = df %>% unite(cat_val, category, value, remove = F)
# create vectorised function that counts matches (on that new value)
f = function(x,y) sum(df$cat_val[df$ID == x] == df$cat_val[df$ID == y])
f = Vectorize(f)
data.frame(t(combn(unique(df$ID), 2))) %>% # create combinations of IDs (as a dataframe)
mutate(matches = f(X1, X2)) # apply function
# X1 X2 matches
# 1 ID1 ID2 2
# 2 ID1 ID3 1
# 3 ID2 ID3 2
Upvotes: 1
Reputation: 18681
Not sure if this is tidier, but we can do something like the following:
library(tidyverse)
combn_join <- function(x) {
map2(combn(1:3, 2)[1,], combn(1:3, 2)[2,],
~ left_join(x[[.x]], x[[.y]], by = c("category", "value")) %>%
select(ID.x, ID.y))
}
df %>%
split(.$ID) %>%
combn_join(.) %>%
do.call(rbind, .) %>%
filter(!is.na(ID.y)) %>%
group_by(ID.x, ID.y) %>%
summarize(matches = n())
Result:
# A tibble: 3 x 3
# Groups: ID.x [?]
ID.x ID.y matches
<fct> <fct> <int>
1 ID1 ID2 2
2 ID1 ID3 1
3 ID2 ID3 2
Data:
df <- structure(list(ID = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L), .Label = c("ID1", "ID2", "ID3"), class = "factor"), category = structure(c(2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L), .Label = c("color", "length",
"type"), class = "factor"), value = structure(c(1L, 4L, 3L, 1L,
5L, 3L, 2L, 5L, 3L), .Label = c("100", "150", "Blue", "L", "M"
), class = "factor")), .Names = c("ID", "category", "value"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))
Upvotes: 0