Reputation: 6743
I am trying to create a data set of co-occurrence data where the variable of interest is a software application and I want to simulate an n by n matrix where each cell has a number that says the number of times application A was used with application B. How can I create a data set in R that I can use to test a set of clustering and partitioning algorithms. What model would I use and how would I generate the data in R?
Upvotes: 1
Views: 598
Reputation: 93803
set.seed(42)
# software names:
software <- c("a","b","c","d")
# times each software used:
times.each.sw <- c(5,10,12,3)
# co-occurrence data.frame
swdf <- setNames(data.frame(t(combn(software,2))),c("sw1","sw2"))
swdf$freq.cooc <- apply(combn(times.each.sw,2),2,function(x) sample(1:min(x),1) )
# sw1 sw2 freq.cooc
#1 a b 5
#2 a c 5
#3 a d 1
#4 b c 9
#5 b d 2
#6 c d 2
If you prefer a matrix of co-occurrence, then something like this maybe:
mat <- diag(times.each.sw)
dimnames(mat) <- list(software,software)
mat[lower.tri(mat)] <- swdf$freq.cooc
mat[upper.tri(mat)] <- t(mat)[upper.tri(mat)]
# a b c d
#a 5 5 5 1
#b 5 10 9 2
#c 5 9 12 2
#d 1 2 2 3
The diagonal contains the number of times each software was used (i.e. used with itself). The lower/upper triangles will contain the number of times each combination was used, which will always have to be equal or less to the number of times the less frequently used of the pair was used.
Upvotes: 1
Reputation: 59335
n <- 10
apps <- LETTERS[1:n]
data <- matrix(0,n,n)
rownames(data) <- apps
colnames(data) <- apps
# create artificial clusters
data[1:3,1:5] <- matrix(sample(3:5,15,replace=T),3,5)
data[6:9,4:8] <- matrix(sample(1:3,20,replace=T),4,5)
# clustering
hc <- hclust(dist(data))
plot(hc)
rect.hclust(hc, k=2)
Note: This answer has been edited to reflect the fact the the co-occurrence matrix must be symmetric.
Upvotes: 1