Reputation: 229
I am trying to write apriori algorithm in R code. First I want to count the frequency of each item in the list. I have the initial code as below:
a_list <- list(c("I1","I2","I5"),
c("I2","I4"),
c("I2","I3"),
c("I1","I2","I4"),
c("I1","I3"),
c("I2","I3"),
c("I1","I3"),
c("I1","I2","I3","I5"),
c("I1","I2","I3"))
sapply(a_list, function(x) length(x))
un <- unique(unlist(a_list))
nm <- lapply(un, function(x) sapply(a_list, function(y) sum(y == x)))
names(nm) <- un
nm
I have the result as:
> nm
$I1
[1] 1 0 0 1 1 0 1 1 1
$I2
[1] 1 1 1 1 0 1 0 1 1
$I5
[1] 1 0 0 0 0 0 0 1 0
$I4
[1] 0 1 0 1 0 0 0 0 0
$I3
[1] 0 0 1 0 1 1 1 1 1
However, I want it to be arranged as (maybe relist in a matrix or array, then I can do further with it):
> nm
I1 6
I2 7
I3 6
I4 2
I5 2
Each item shows the frequency count and in alphabetic order. Is there any way to implement it? I tried cbind, apply, relist, but haven't found a solution yet. Thanks
UPDATE:
library(dplyr)
a_list <- list(c("I1","I2","I5"),
c("I2","I4"),
c("I2","I3"),
c("I1","I2","I4"),
c("I1","I3"),
c("I2","I3"),
c("I1","I3"),
c("I1","I2","I3","I5"),
c("I1","I2","I3"))
a <- unlist(a_list) %>% table %>% data.frame
a
minsupport = 3
b <- data.frame(a)
c <- b[b$Freq > minsupport,]
c
Now I have result as:
> a
. Freq
1 I1 6
2 I2 7
3 I3 6
4 I4 2
5 I5 2
> c
. Freq
1 I1 6
2 I2 7
3 I3 6
How can I then set up a combination of "I1,I2", ...,"I2,I3" from scanning original list?
UpDATE: I tried combn as below, it output a matrix.
> combn(c$.,2)
[,1] [,2] [,3]
[1,] I1 I1 I2
[2,] I2 I3 I3
Levels: I1 I2 I3 I4 I5
It is further modified to:
d <- combn(c$.,2)
result <- unique(sapply(d,function(i) paste(d[,i],collapse=",")))
result
My result is:
> result
[1] "I1,I2" "I1,I3" "I2,I3"
Next thing is to count the frequency of above itemsets from original "a_list". Maybe it is better to output as
""I1","I2"", ""I1","I3"", ""I2","I3""
in order to compare with original list.
How can I get the frequency of the itemset in this matrix from original a_list? The apriori algorithm requires scanning all itemset no less than minimum support, starting from 1 dimension (i.e. "I1", "I2",...,"I5" in a_list) to 2 dimensions (ie. "I1,I2" "I1,I3" "I2,I3" in this case), and then on, if it is applicable (e.g. "I1,I2,I3").
UPDATE: Now I can find the match with a specific pattern, .e.g, ("I1","I2") or ("I1","I3"), individually.
toMatch <- c("I1","I2")
matches <- grepRaw(toMatch,a_list,ignore.case = TRUE)
matches
Results:
> matches
[1] 4
Issues remain to be resolved for matching all patterns in "result" (I manually input the pattern in above example, but it is needed to be extracted from "result") at one time. And output them in a form of:
Itemset Freq
""I1","I2"" 4
""I1","I3"" 4
""I2","I3"" 4
Upvotes: 0
Views: 1263
Reputation: 356
the dplyr
package makes this operation clear.
library(dplyr)
unlist(a_list) %>% table %>% data.frame
unlist.a_list. Freq
1 I1 6
2 I2 7
3 I3 6
4 I4 2
5 I5 2
UPDATE:
Im not sure exactly what you're looking for, but here is how to get the combinations:
Cols <- paste0("I",1:3)
p <- length(Cols)
id <- unlist(lapply(1:p, function(i) combn(1:p,i,simplify=F)), recursive=F)
formulas <- sapply(id,function(i) paste(Cols[i],collapse=","))
> formulas
[1] "I1" "I2" "I3" "I1,I2" "I1,I3" "I2,I3" "I1,I2,I3"
UPDATE 2:
This should do what you need:
library(dplyr)
a_list <- list(c("I1","I2","I5"),
c("I2","I4"),
c("I2","I3"),
c("I1","I2","I4"),
c("I1","I3"),
c("I2","I3"),
c("I1","I3"),
c("I1","I2","I3","I5"),
c("I1","I2","I3"))
a <- unlist(a_list) %>% table %>% data.frame
minsupport = 3
b <- data.frame(a)
c <- b[b$Freq > minsupport,]
d <- combn(c$.,2)
result <- unique(sapply(d,function(i) paste(d[,i],collapse=",")))
> result
[1] "I1,I2" "I1,I3" "I2,I3"
Then collapse your a_list to look like result:
a.new.list <- sapply(a_list, paste, collapse=",")
> a.new.list
[1] "I1,I2,I5" "I2,I4" "I2,I3" "I1,I2,I4" "I1,I3" "I2,I3" "I1,I3"
[8] "I1,I2,I3,I5" "I1,I2,I3"
Use the match
function and loop over all results:
hits <- sapply(1:length(result), function(j) match(a.new.list,result[j]))
colnames(hits) <- result
rownames(hits) <- a.new.list
> hits
I1,I2 I1,I3 I2,I3
I1,I2,I5 NA NA NA
I2,I4 NA NA NA
I2,I3 NA NA 1
I1,I2,I4 NA NA NA
I1,I3 NA 1 NA
I2,I3 NA NA 1
I1,I3 NA 1 NA
I1,I2,I3,I5 NA NA NA
I1,I2,I3 NA NA NA
> apply(hits,2, sum, na.rm=TRUE)
I1,I2 I1,I3 I2,I3
0 2 2
Upvotes: 1