Reputation: 1897
I have a table of user_id - category pairs. Users can fall into multiple categories. I'm am try to get counts cross-category for each possible outcome. i.e. number of users who were in category A and also category C, etc.
My raw data is structured like this:
I would like results that look like this, showing counts cross-category:
How can this be accomplished in R or MySQL? The data is quite large.
Here's are sample data:
data <- structure(list(category = structure(c(1L, 2L, 2L, 1L, 3L, 3L,
2L, 1L, 3L, 2L, 2L, 2L, 3L, 1L, 1L, 3L), .Label = c("A", "B",
"C"), class = "factor"), user_id = c(464L, 345L, 342L, 312L,
345L, 234L, 423L, 464L, 756L, 756L, 345L, 345L, 464L, 345L, 234L,
312L)), .Names = c("category", "user_id"), class = "data.frame", row.names = c(NA,
-16L))
Any code snippets, thoughts on approach, functions, or package recommendations would be appreciated. Thank you! -John
Upvotes: 5
Views: 349
Reputation: 4563
You could use dplyr
to create a list of all the unique pairs, and the crossprod
to count the number of users that are common to a pair of categories.
> library(dplyr)
> data <- data %>% group_by(user_id, category) %>% summarize(records = sign(n()))
> crossprod(table(data$user_id, data$category))
A B C
A 4 1 4
B 1 4 2
C 4 2 5
Upvotes: 0
Reputation: 3495
Based on the sample data provided I actually don't think the R solution provided by @josilber is correct, though I could be wrong given the lack of the example solution desired. I think you could do this with igraph
and its bipartite network representation of the data, but this could be inefficient on larger sets of data/categories. As an alternative a relatively efficient calculation in R using a sparse matrix representation of the data can be done like this:
library('Matrix')
mat <- spMatrix(nrow=length(unique(data$category)),
ncol=length(unique(data$user_id)),
i = as.numeric(factor(data$category)),
j = as.numeric(factor(data$user_id)),
x = rep(1, length(as.numeric(data$category)))
)
rownames(mat) <- levels(factor(data$category))
colnames(mat) <- levels(factor(data$user_id))
mat
#mat_row <- mat %*% t(mat)
## Based on @user20650's comment this is even more efficient than
## the multiplication above:
mat_row <- tcrossprod(mat)
This I think yields the following correct output based on the sample data above:
> mat_row
3 x 3 sparse Matrix of class "dgCMatrix"
A B C
A 7 3 5
B 3 12 4
C 5 4 5
Upvotes: 1
Reputation: 44340
In R, I would approach this by first splitting up the data by user, computing all the unique pairs of categories for that user and then grouping together:
data$category <- as.character(data$category)
(combos <- do.call(rbind, tapply(data$category, data$user_id, function(x) {
u <- unique(x)
if (length(u) > 1) t(combn(u, 2))
else NULL
})))
# [,1] [,2]
# [1,] "C" "A"
# [2,] "A" "C"
# [3,] "B" "C"
# [4,] "B" "A"
# [5,] "C" "A"
# [6,] "A" "C"
# [7,] "C" "B"
The last step is to tabulate the pairs, which can be done with the table
function in R. We'll actually use table
twice to capture (a, b) and (b, a) for each pairing of categories a and b:
table(combos[,1], combos[,2]) + table(combos[,2], combos[,1])
# A B C
# A 0 1 4
# B 1 0 2
# C 4 2 0
Upvotes: 2
Reputation: 1271111
In MySQL, you can readily do this in a three-column format:
select a.category, b.category, count(*)
from pairs a join
pairs b
on a.user_id = b.user_id
group by a.category, b.category;
Producing a table as a matrix is challenging in SQL, unless you know all the column names. Otherwise, you need a dynamic pivot (google: "mysql dynamic pivot"). For most purposes in the database, the three column format is preferable.
Upvotes: 0