Reputation: 49

How to create a matrix of different format from a data frame in R?

My data.frame is below:

group_id  user_id
1000       26
1236       29
1236       46
3767       26
3767       46
5614       29
5614       45
5614       46

I need output as below:

User-1   User-2  #of-common-groups
26       26       2
26       46       1
29       29       2
29       45       1
29       46       2
45       29       1
45       45       1
45       46       1
46       26       1
46       29       2    
46       45       1
46       46       3

Is there a quick way to achieve this? I actually have 137 distinct groups and about 81000 users.

The user 26 belongs to 2 groups and he also shares the same group 3767 with user 46. Hence

26 26 2
26 46 1
46 26 1
46 46 3  (user 46 belongs to 3 groups) etc

Upvotes: 2

Answers (4)

J.R.

Reputation: 3878

what about:

df <- read.table(text="group_id  user_id
1000       26
1236       29
1236       46
3767       26
3767       46
5614       29
5614       45
5614       46", header=T)

df <- merge(df, df, by = "group_id")[,-1]
library(plyr)
ddply(df,.(user_id.x, user_id.y),nrow)

   user_id.x user_id.y V1
1         26        26  2
2         26        46  1
3         29        29  2
4         29        45  1
5         29        46  2
6         45        29  1
7         45        45  1
8         45        46  1
9         46        26  1
10        46        29  2
11        46        45  1
12        46        46  3

EDIT: I was afraid this was to simple in this case, because of the merge having a "large" number of users and small number of groups. Depending of course on the final use, I would consider a graph structure as already suggested by user20650 and probably keep it as such. An undirected weighted graph with a quick look up for the vertices (user.id) seems like a good solution in many cases.

I'll leave this simple method in case of a smaller dataset (or just less overlap).

Upvotes: 3

Arun

Reputation: 118789

Here's an attempt using the Matrix package - just replicating @nograpes' answer from here:

require(Matrix)
sm = sparseMatrix(dat$group_id, dat$user_id, x = TRUE)
cp = t(sm) %*% sm 
as.data.frame(summary(cp))
#     i  j x
# 1  26 26 2
# 2  46 26 1
# 3  29 29 2
# 4  45 29 1
# 5  46 29 2
# 6  29 45 1
# 7  45 45 1
# 8  46 45 1
# 9  26 46 1
# 10 29 46 2
# 11 45 46 1
# 12 46 46 3

Upvotes: 3

user20650

Reputation: 25854

# your data
dat <- read.table(text="group_id  user_id
1000       26
1236       29
1236       46
3767       26
3767       46
5614       29
5614       45
5614       46", header=T)

# convert to matrix
m <- as.matrix(table(dat))

#calculate and rehape
mm <- crossprod(m,m)
r <- reshape2::melt(mm)

# remove where zero counts
r[r$value !=0 ,]
# user_id user_id value
# 1       26      26     2
# 4       46      26     1
# 6       29      29     2
# 7       45      29     1
# 8       46      29     2
# 10      29      45     1
# 11      45      45     1
# 12      46      45     1
# 13      26      46     1
# 14      29      46     2
# 15      45      46     1
# 16      46      46     3

EDIT: idea from Network: Making Graph Object from Event-Node Data Using igraph

g <- graph.data.frame(dat, directed = FALSE)

V(g)$type <- V(g)$name %in% unique(as.character(dat$group_id))

b <- bipartite.projection(g)$proj1

ad <- get.adjacency(b, sparse=F, attr="weight")
ad <- ad[sort(colnames(ad)), sort(colnames(ad))]

diag(ad) <- colSums(table(dat))

then continue as before

Upvotes: 3

jlhoward

Reputation: 59345

So here are two approaches, one using data.table...

library(data.table)
setkey(setDT(df),group_id)
df[df,allow.cartesian=TRUE][,.N,by=list(user_id,i.user_id)][order(user_id,i.user_id)]
#     user_id i.user_id N
#  1:      26        26 2
#  2:      26        46 1
#  3:      29        29 2
#  4:      29        45 1
#  5:      29        46 2
#  6:      45        29 1
#  7:      45        45 1
#  8:      45        46 1
#  9:      46        26 1
# 10:      46        29 2
# 11:      46        45 1
# 12:      46        46 3

and one using sqldf...

library(sqldf)
sqldf("select a.user_id as user1, b.user_id as user2, count(*) as groups
      from df a inner join df b on a.group_id=b.group_id
      group by 1,2 order by 1,2")
#    user1 user2 groups
# 1     26    26      2
# 2     26    46      1
# 3     29    29      2
# 4     29    45      1
# 5     29    46      2
# 6     45    29      1
# 7     45    45      1
# 8     45    46      1
# 9     46    26      1
# 10    46    29      2
# 11    46    45      1
# 12    46    46      3

The data.table approach is likely to be faster, but your dataset is not very large so it might not make that much of a difference.

Upvotes: 2

How to create a matrix of different format from a data frame in R?

Answers (4)

Related Questions