aev1tas
aev1tas

Reputation: 21

Need to find most common combination of letters

Let's say for simplicity that i have 10 rows of 5 characters where each character can be A-Z.

E.g//

KJGXI
GDGQT
JZKDC
YOTQD
SSDIQ
PLUWC
TORHC
PFJSQ
IIZMO
BRPOJ
WLMDX
AZDIJ
ARNUA
JEXGA
VFPIP
GXOXM
VIZEM
TFVQJ
OFNOG
QFNJR
ZGUBZ
CCTMB
HZPGV
ORQTJ

I want to know which 3 letter combination is most common. However, the combination does not need to be in order, nor next to each other. E.g

ABCXY
CQDBA

=ABC 

I could probably brute-force it with endless loops but I was wondering if there was a better way of doing it!

Upvotes: 1

Views: 204

Answers (2)

etienne
etienne

Reputation: 3678

Here is a solution:

x <- c("KJGXI", "GDGQT", "JZKDC", "YOTQD", "SSDIQ", "PLUWC", "TORHC", "PFJSQ", "IIZMO", "BRPOJ", "WLMDX", "AZDIJ",
       "ARNUA", "JEXGA", "VFPIP", "GXOXM", "VIZEM", "TFVQJ", "OFNOG", "QFNJR", "ZGUBZ", "CCTMB", "HZPGV", "ORQTJ")

temp <- do.call(cbind, lapply(strsplit(x, ""), combn, m = 3))

temp <- apply(temp, 2, sort)
temp <- apply(temp, 2, paste0, collapse = "")

sort(table(temp), decreasing = TRUE)

which will return the number of times each combination appear. You can then use names(which.max(sort(table(temp), decreasing = TRUE))) to have the combination (in this case, "FJQ")

In this case, two combinations appear 3 times, you can do

result <- sort(table(temp), decreasing = TRUE)
names(which(result == max(result)))
# [1] "FJQ" "IMZ"

to have the two combinations which appear the most time.


The code works as follow:

  • split each element of x in five letters, then generate each possible combination of 3 elements from the 5 letters
  • sort each of those combination alphabetically
  • paste the 3 letters together
  • generate the count for each of those combinations, and sort the result

Upvotes: 2

Richard Telford
Richard Telford

Reputation: 9923

I would split each string into letters, sort them, then use combn to get all combinations. Use paste0 to collapse these back into strings and count.

txt <- c("KJGXI", "GDGQT", "JZKDC", "YOTQD", "SSDIQ", "PLUWC", "TORHC", 
     "PFJSQ", "IIZMO", "BRPOJ", "WLMDX", "AZDIJ", "ARNUA", "JEXGA", 
     "VFPIP", "GXOXM", "VIZEM", "TFVQJ", "OFNOG", "QFNJR", "ZGUBZ", 
     "CCTMB", "HZPGV", "ORQTJ")
txt2 <- strsplit(txt, split = "")

txt2 <- lapply(txt2, sort)
txt3 <- lapply(txt2, combn, m = 3)

txt4 <- lapply(txt3, function(x){apply(x, 2, paste0, collapse = "")})
table(unlist(txt4))

Several steps here could be combined.

Upvotes: 1

Related Questions