Reputation: 19
I have a huge database of 180 columns and 200,000 rows. To illustrate in a better way, I have a matrix of 180 x 200000. Each matrix is a single digit number. I need to find their co-occurrence count. For example I have a data of 5 columns having values 1,2,3,4,5. I need to find the number of times (1,2),(1,3),(1,4),(1,5),(2,3),(2,4),(2,5),(3,4),(3,5),(4,5) have occurred in the database. Can you please suggest me an approach to this problem? I have an exposure to R and python. So any suggestion using those will really help. Can this also be done using AWS map reducer? Any help or pointers on those lines would also be helpful.
Upvotes: 1
Views: 195
Reputation: 1246
Counts of pairs are just products of counts of singletons.
This takes 5 seconds on my year old MacBook Pro using R:
Generate a matrix of 200000 rows and 180 columns whose elements are digits:
mat <- matrix(sample(0:9,180*200000,repl=T),nc=180)
Now table digits in each row:
tab <- sapply( 0:9, function(x) rowSums( mat==x ))
Now find the pair counts in each row:
cp <- combn( 0:9, 2, function(x) tab[,1+x[1] ] * tab[,1+x[2] ])
Sum the rows:
colSums(cp)
Verify the result for the first row:
tab2 <- table( matrix(mat[1,], nr=180, nc=180), matrix(mat[1,], nr=180, nc=180, byrow=TRUE))
all( tab2[ lower.tri(tab2)] == cp[1,] )
Upvotes: 2