g_puffo
g_puffo

Reputation: 623

Matching and Counting Strings of Characters in R

I have an array of strings of characters made up of all the possible combinations of the 4 letters J, K, Q, Z. The entries in the array are made up of at least two letters and at most 4. For example: data<-c("QK", "KQ", "JKQZ", "KJZ").

I would like to count the number of times each entry in the array occurs but without differentiating between strings that are made up of the same letters but in different order. I know table(data) doesn't do this since it doesn't think of QK and KQ as the same and returns

data
JKQZ  KJZ   KQ   QK 
   1    1    1    1 

I have been looking at pmatch or charmatch but that doesn't seem to do what I want.

EDIT: I should clarify that there are no entries in which a letter is repeated. In essence, I cannot have an entry ZZ or KZK

Upvotes: 4

Views: 161

Answers (2)

Neal Fultz
Neal Fultz

Reputation: 9687

I would first make a table per observation (set as a factor to get the zero cells), then hash each table and count that:

require(magrittr)
require(digest)
data<-c("QK", "KQ", "JKQZ", "KJZ")
tbl <- strsplit(data, "") %>% lapply(factor,levels=c("K","Q", "J", "Z")) %>%
lapply(table) %>%  do.call(what=rbind)
tbl

which gives this:

     K Q J Z
[1,] 1 1 0 0
[2,] 1 1 0 0
[3,] 1 1 1 1
[4,] 1 0 1 1

Then hash and count:

h <- apply(tbl, 1, digest)
tbl <- cbind(tbl, count=as.vector(table(h)[h]))
tbl <- tbl[!duplicated(h), ]

Here's the result:

     K Q J Z count
[1,] 1 1 0 0     2
[2,] 1 1 1 1     1
[3,] 1 0 1 1     1

Upvotes: 1

Frank
Frank

Reputation: 66819

Here's a longer variation on David's comment/answer:

vals    <- sort(unique(unlist(strsplit(data,''))))
combos  <- unlist(sapply(seq_along(vals),function(i)combn(vals,i,paste0,collapse="")))
newdata <- factor(sapply(strsplit(data,""),function(x)paste0(sort(x),collapse="")),
             levels=combos)
tab <- table(newdata)
# newdata
#    J    K    Q    Z   JK   JQ   JZ   KQ   KZ   QZ  JKQ  JKZ  JQZ  KQZ JKQZ 
#    0    0    0    0    0    0    0    2    0    0    0    1    0    0    1 
tab[tab>0] # alternately
#   KQ  JKZ JKQZ 
#    2    1    1 

Upvotes: 2

Related Questions