Reputation: 49
My data.frame is below:
group_id user_id
1000 26
1236 29
1236 46
3767 26
3767 46
5614 29
5614 45
5614 46
I need output as below:
User-1 User-2 #of-common-groups
26 26 2
26 46 1
29 29 2
29 45 1
29 46 2
45 29 1
45 45 1
45 46 1
46 26 1
46 29 2
46 45 1
46 46 3
Is there a quick way to achieve this? I actually have 137 distinct groups and about 81000 users.
The user 26 belongs to 2 groups and he also shares the same group 3767 with user 46. Hence
26 26 2
26 46 1
46 26 1
46 46 3 (user 46 belongs to 3 groups) etc
Upvotes: 2
Views: 178
Reputation: 3878
what about:
df <- read.table(text="group_id user_id
1000 26
1236 29
1236 46
3767 26
3767 46
5614 29
5614 45
5614 46", header=T)
df <- merge(df, df, by = "group_id")[,-1]
library(plyr)
ddply(df,.(user_id.x, user_id.y),nrow)
user_id.x user_id.y V1
1 26 26 2
2 26 46 1
3 29 29 2
4 29 45 1
5 29 46 2
6 45 29 1
7 45 45 1
8 45 46 1
9 46 26 1
10 46 29 2
11 46 45 1
12 46 46 3
EDIT:
I was afraid this was to simple in this case, because of the merge
having a "large" number of users and small number of groups. Depending of course on the final use, I would consider a graph structure as already suggested by user20650 and probably keep it as such. An undirected weighted graph with a quick look up for the vertices (user.id) seems like a good solution in many cases.
I'll leave this simple method in case of a smaller dataset (or just less overlap).
Upvotes: 3
Reputation: 118789
Here's an attempt using the Matrix
package - just replicating @nograpes' answer from here:
require(Matrix)
sm = sparseMatrix(dat$group_id, dat$user_id, x = TRUE)
cp = t(sm) %*% sm
as.data.frame(summary(cp))
# i j x
# 1 26 26 2
# 2 46 26 1
# 3 29 29 2
# 4 45 29 1
# 5 46 29 2
# 6 29 45 1
# 7 45 45 1
# 8 46 45 1
# 9 26 46 1
# 10 29 46 2
# 11 45 46 1
# 12 46 46 3
Upvotes: 3
Reputation: 25854
# your data
dat <- read.table(text="group_id user_id
1000 26
1236 29
1236 46
3767 26
3767 46
5614 29
5614 45
5614 46", header=T)
# convert to matrix
m <- as.matrix(table(dat))
#calculate and rehape
mm <- crossprod(m,m)
r <- reshape2::melt(mm)
# remove where zero counts
r[r$value !=0 ,]
# user_id user_id value
# 1 26 26 2
# 4 46 26 1
# 6 29 29 2
# 7 45 29 1
# 8 46 29 2
# 10 29 45 1
# 11 45 45 1
# 12 46 45 1
# 13 26 46 1
# 14 29 46 2
# 15 45 46 1
# 16 46 46 3
EDIT: idea from Network: Making Graph Object from Event-Node Data Using igraph
g <- graph.data.frame(dat, directed = FALSE)
V(g)$type <- V(g)$name %in% unique(as.character(dat$group_id))
b <- bipartite.projection(g)$proj1
ad <- get.adjacency(b, sparse=F, attr="weight")
ad <- ad[sort(colnames(ad)), sort(colnames(ad))]
diag(ad) <- colSums(table(dat))
then continue as before
Upvotes: 3
Reputation: 59345
So here are two approaches, one using data.table
...
library(data.table)
setkey(setDT(df),group_id)
df[df,allow.cartesian=TRUE][,.N,by=list(user_id,i.user_id)][order(user_id,i.user_id)]
# user_id i.user_id N
# 1: 26 26 2
# 2: 26 46 1
# 3: 29 29 2
# 4: 29 45 1
# 5: 29 46 2
# 6: 45 29 1
# 7: 45 45 1
# 8: 45 46 1
# 9: 46 26 1
# 10: 46 29 2
# 11: 46 45 1
# 12: 46 46 3
and one using sqldf
...
library(sqldf)
sqldf("select a.user_id as user1, b.user_id as user2, count(*) as groups
from df a inner join df b on a.group_id=b.group_id
group by 1,2 order by 1,2")
# user1 user2 groups
# 1 26 26 2
# 2 26 46 1
# 3 29 29 2
# 4 29 45 1
# 5 29 46 2
# 6 45 29 1
# 7 45 45 1
# 8 45 46 1
# 9 46 26 1
# 10 46 29 2
# 11 46 45 1
# 12 46 46 3
The data.table approach is likely to be faster, but your dataset is not very large so it might not make that much of a difference.
Upvotes: 2