CAbraham
CAbraham

Reputation: 11

Finding All Pairwise Commonalities in R

I have been a StackExchange Lurker forever now, but have not had much luck finding this question in R before, so I created a username just for this.

Basically I have a data set of Customers and Stores (roughly 260k customers and 300 stores with most of the customers visiting at least 10 unique stores), and I want to see which sites overlap on customers the most (i.e. Site A and B share this many customers, A and C that many, ... for ALL Pairs of sites).

Reproducible example:

begindata<-data.frame(customer_id=c(1,2,3,1,2,3,4,1,4,5), site_visited=c('A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'D', 'D'))

and I would like to see the following, if possible:

final_table<-data.frame(site_1=c('A', 'A', 'A', 'B', 'B', 'C'), site_2=c('B', 'C', 'D', 'C', 'D', 'D'), number_of_commonalities=c(3, 1,1,1,1,0))

I have tried joining begindata to itself based on customer_id, something like this...

attempted_df<-begindata %>% left_join(begindata, by="customer_id") %>% count(site_visited.x, site_visited.y)

Which I know is redundant (lines that go A, B, 3; B, A, 3; as well as lines that go A, A, 3).

However, this cannot be executed with my actual data set (260k members and 300 sites) due to size limitations.

Any advice would be greatly appreciated! Also go easy on me if my post sucks--I think I have followed the rules and suggestions?

Upvotes: 1

Views: 47

Answers (1)

akrun
akrun

Reputation: 887128

We could use combn

number_of_commonalities <- combn(unique(begindata$site_visited), 2, 
    FUN = function(x) with(begindata, 
    length(intersect(customer_id[site_visited == x[1]], 
     customer_id[site_visited == x[2]]))))

names(number_of_commonalities) <- combn(unique(begindata$site_visited), 2, 
         FUN = paste, collapse="_")

stack(number_of_commonalities)[2:1]

Upvotes: 2

Related Questions