Reputation: 854
I have a large data set with month, customer ID and store ID. There is one record per customer, per location, per month summarizing their activity at that location.
Month Customer ID Store
Jan 1 A
Jan 4 A
Jan 2 A
Jan 3 A
Feb 7 B
Feb 2 B
Feb 1 B
Feb 12 B
Mar 1 C
Mar 11 C
Mar 3 C
Mar 12 C
I'm interested in creating a matrix that shows the number of customers that each location shares with another. Like this:
A B C
A 4 2 2
B 2 4 2
C 2 2 4
For example, since customer visited Store A and then Store B in the next month, they would be added to the tally. I'm interested in number of shared customers, not number of visits.
I tried the sparse matrix approach in this thread(Creating co-occurrence matrix), but the numbers returned don't match up for some reason I cannot understand.
Any ideas would be greatly appreciated!
Upvotes: 1
Views: 339
Reputation: 23101
We can try this too:
library(reshape2)
df <- dcast(df,CustomerID~Store, length, value.var='Store')
# CustomerID A B C
#1 1 1 1 1
#2 2 1 1 0 # Customer 2 went to stores A,B but not to C
#3 3 1 0 1
#4 4 1 0 0
#5 7 0 1 0
#6 11 0 0 1
#7 12 0 1 1
crossprod(as.matrix(df[-1]))
# A B C
#A 4 2 2
#B 2 4 2
#C 2 2 4
with library arules
:
library(arules)
write(' Jan 1 A
Jan 4 A
Jan 2 A
Jan 3 A
Feb 7 B
Feb 2 B
Feb 1 B
Feb 12 B
Mar 1 C
Mar 11 C
Mar 3 C
Mar 12 C', 'basket_single')
tr <- read.transactions("basket_single", format = "single", cols = c(2,3))
inspect(tr)
# items transactionID
#[1] {A,B,C} 1
#[2] {C} 11
#[3] {B,C} 12
#[4] {A,B} 2
#[5] {A,C} 3
#[6] {A} 4
#[7] {B} 7
image(tr)
crossTable(tr, sort=TRUE)
# A B C
#A 4 2 2
#B 2 4 2
#C 2 2 4
Upvotes: 1
Reputation: 37641
Update: The original solution that I posted worked for your data. But your data has the unusual property that no customer ever visited the same store in two different months. Presuming that would happen, a modification is needed.
What we need is a matrix of stores by customers that has 1 if the customer ever
visited the store and zero otherwise. The original solution used
M = as.matrix(table(Dat$ID_Store, Dat$Customer))
which gives how many different months the store was visited by each customer. With
different data, these numbers might be more than one. We can fix that by using
M = as.matrix(table(Dat$ID_Store, Dat$Customer) > 0)
If you look at this matrix, it will say TRUE and FALSE, but since TRUE=1 and FALSE=0
that will work just fine. So the full corrected solution is:
M = as.matrix(table(Dat$ID_Store, Dat$Customer) > 0)
M %*% t(M)
A B C
A 4 2 2
B 2 4 2
C 2 2 4
Upvotes: 3