Super_John
Super_John

Reputation: 1897

Creating a Cross-frequency table in R or MySQL

I have a table of user_id - category pairs. Users can fall into multiple categories. I'm am try to get counts cross-category for each possible outcome. i.e. number of users who were in category A and also category C, etc.

My raw data is structured like this:

example results

I would like results that look like this, showing counts cross-category:

example results

How can this be accomplished in R or MySQL? The data is quite large.

Here's are sample data:

data <- structure(list(category = structure(c(1L, 2L, 2L, 1L, 3L, 3L, 
2L, 1L, 3L, 2L, 2L, 2L, 3L, 1L, 1L, 3L), .Label = c("A", "B", 
"C"), class = "factor"), user_id = c(464L, 345L, 342L, 312L, 
345L, 234L, 423L, 464L, 756L, 756L, 345L, 345L, 464L, 345L, 234L, 
312L)), .Names = c("category", "user_id"), class = "data.frame", row.names = c(NA, 
-16L))

Any code snippets, thoughts on approach, functions, or package recommendations would be appreciated. Thank you! -John

Upvotes: 5

Views: 349

Answers (4)

Alex Woolford
Alex Woolford

Reputation: 4563

You could use dplyr to create a list of all the unique pairs, and the crossprod to count the number of users that are common to a pair of categories.

> library(dplyr)
> data <- data %>% group_by(user_id, category) %>% summarize(records = sign(n()))
> crossprod(table(data$user_id, data$category))

    A B C
  A 4 1 4
  B 1 4 2
  C 4 2 5

Upvotes: 0

Forrest R. Stevens
Forrest R. Stevens

Reputation: 3495

Based on the sample data provided I actually don't think the R solution provided by @josilber is correct, though I could be wrong given the lack of the example solution desired. I think you could do this with igraph and its bipartite network representation of the data, but this could be inefficient on larger sets of data/categories. As an alternative a relatively efficient calculation in R using a sparse matrix representation of the data can be done like this:

library('Matrix')
mat <- spMatrix(nrow=length(unique(data$category)),
    ncol=length(unique(data$user_id)),
    i = as.numeric(factor(data$category)),
    j = as.numeric(factor(data$user_id)),
    x = rep(1, length(as.numeric(data$category)))
)
rownames(mat) <- levels(factor(data$category))
colnames(mat) <- levels(factor(data$user_id))
mat

#mat_row <- mat %*% t(mat)

##  Based on @user20650's comment this is even more efficient than
##    the multiplication above:
mat_row <- tcrossprod(mat)

This I think yields the following correct output based on the sample data above:

> mat_row
3 x 3 sparse Matrix of class "dgCMatrix"
  A  B C
A 7  3 5
B 3 12 4
C 5  4 5

Upvotes: 1

josliber
josliber

Reputation: 44340

In R, I would approach this by first splitting up the data by user, computing all the unique pairs of categories for that user and then grouping together:

data$category <- as.character(data$category)
(combos <- do.call(rbind, tapply(data$category, data$user_id, function(x) {
  u <- unique(x)
  if (length(u) > 1) t(combn(u, 2))
  else NULL
})))
#      [,1] [,2]
# [1,] "C"  "A" 
# [2,] "A"  "C" 
# [3,] "B"  "C" 
# [4,] "B"  "A" 
# [5,] "C"  "A" 
# [6,] "A"  "C" 
# [7,] "C"  "B" 

The last step is to tabulate the pairs, which can be done with the table function in R. We'll actually use table twice to capture (a, b) and (b, a) for each pairing of categories a and b:

table(combos[,1], combos[,2]) + table(combos[,2], combos[,1])
#     A B C
#   A 0 1 4
#   B 1 0 2
#   C 4 2 0

Upvotes: 2

Gordon Linoff
Gordon Linoff

Reputation: 1271111

In MySQL, you can readily do this in a three-column format:

select a.category, b.category, count(*)
from pairs a join
     pairs b
     on a.user_id = b.user_id
group by a.category, b.category;

Producing a table as a matrix is challenging in SQL, unless you know all the column names. Otherwise, you need a dynamic pivot (google: "mysql dynamic pivot"). For most purposes in the database, the three column format is preferable.

Upvotes: 0

Related Questions