Reputation: 173

Count number of occurrence, with order of string counted only 1x

Revised question with more accurate dataset example

I have several different lists, with each lists containing many characters. I've written up a very short example here

List1 <- "A + B + C + D + E:F + F:E"

List2<- "A + B + C + E:F + F:E + G:H + H:G"

List3 <- "J + K + L + L:H + L:H1"

I'm trying to find the frequency of occurrence through all of these lists but the duplicate of some items is causing problems.

Through a lot of loops, and X %in% Y, strsplit (splitting before and after ":"), I've gotten this

 sig_var8
     var count
 1     0     0
 2     A     2
 3     B     2
 4     C     2
 5     D     1
 6   E:F     2
 7   F:E     2
 8   G:H     1
 9   H:G     1
 10    J     1
 11    K     1
 12    L     1
 13  L:H     1
 14 L:H1     1

What I would like is this:

sig_var8
     var count
 1     0     0
 2     A     2
 3     B     2
 4     C     2
 5     D     1
 6   E:F     2
 7   G:H     1
 8     J     1
 9     K     1
 10    L     1
 11  L:H     1
 12 L:H1     1

Note: in list 1, E:F and F:E is considered the same and only appears once. Same with list 2 where G:H == H:G, and only counted once. Note that grep isn't the best because L:H and L:H1 in list 3 are not the same, they need to be counted separately (hence the %in%).

Here's the code that I've worked on:

sig_var8<-data.frame(matrix(data=0,nrow=1,ncol=2))
colnames(sig_var8)<-c("var","count")
sig_var8[,1]<-as.character(sig_var8[,1])
sig_var8[,2]<-as.numeric(sig_var8[,2])


for(list in 1:3){
  temp_list<-get(paste0("List",list)) #get the equation above
  assign(paste0("List",list,"a"), gsub(" ","",temp_list)) #remove all spaces in the sentence
  assign(paste0("List",list,"a_split"), strsplit(get(paste0("List",list,"a")),"[+]")) #split where "+" are
  temp_listA<-get(paste0("List",list,"a_split"))[[1]]
  for (item in 1:length(temp_listA)){
    if(isTRUE(temp_listA[item] %in% sig_var8[,1])){
      row_n<-which(sig_var8[,1]==temp_listA[item])
      sig_var8[row_n,2]<-sig_var8[row_n,2]+1
     }
     if(isFALSE(temp_listA[item] %in% sig_var8[,1])){
       row_n<-nrow(sig_var8)
       sig_var8[row_n+1,1]<-temp_listA[item]
       sig_var8[row_n+1,2]<-1
    }
  }
 }

Upvotes: 2

Answers (3)

Rui Barradas

Reputation: 76450

Maybe something like the following does what you want.

Lst <- mget(ls(pattern = "^List"))

Lst <- lapply(Lst, function(x) {
  L <- strsplit(x, ":")
  res <- sapply(L, function(y){
    paste(sort(y), collapse = ":")
  })
  unique(res)
})

table(unlist(Lst))
#
#   A    B    C    D  E:F  G:H  H:L H1:L    J    K    L 
#   2    2    2    1    2    1    1    1    1    1    1

Upvotes: 3

Andrew

Reputation: 5138

I am not 100% sure this is what you are looking for, but if it is I will annotate it.

List1 <- c("A","B","C","D","E:F","F:E")
List2<- c("A","B","C","E:F","F:E","G:H","H:G")
List3 <- c("J","K","L","L:H","L:H1")

Lst <- list(List1, List2, List3)

keep_me <- lapply(Lst, function(x) !duplicated(lapply(strsplit(x, ":", fixed = T), sort)))
Lst_cleaned <- unlist(Map(`[`, Lst, keep_me))
table(Lst_cleaned)
Lst_cleaned
   A    B    C    D  E:F  G:H    J    K    L  L:H L:H1 
   2    2    2    1    2    1    1    1    1    1    1

Edit: added explanation below. Let me know if anything is still unclear or if you run into more issues. I use List1 for the beginning to demonstrate what lapply is doing for each list element. Also, as a side-note, breaking it down made me realize you do not need to use which if you do not want to. You can use the logical vector in Map to subset the elements of Lst

# Spliting the string on the colon and sorting the elements
lapply(strsplit(List1, ":", fixed = T), sort)
[[1]]
[1] "A"

[[2]]
[1] "B"

[[3]]
[1] "C"

[[4]]
[1] "D"

[[5]]
[1] "E" "F"

[[6]]
[1] "E" "F"

# Logical vector for the elements are NOT duplicated
!duplicated(lapply(strsplit(List1, ":", fixed = T), sort))
[1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

# Which gives the indices for TRUE's
which(!duplicated(lapply(strsplit(List1, ":", fixed = T), sort)))
[1] 1 2 3 4 5

# Now, all together: lapply is applying the above logic to 
# each elemnt in Lst, it returns a list of the indices that are not
# duplicates for each vector
lapply(Lst, function(x) which(!duplicated(lapply(strsplit(x, ":", fixed = T), sort))))
[[1]]
[1] 1 2 3 4 5

[[2]]
[1] 1 2 3 4 6

[[3]]
[1] 1 2 3 4 5

keep_me <- lapply(Lst, function(x) which(!duplicated(lapply(strsplit(x, ":", fixed = T), sort))))

# Map subsets (`[`) Lst by the indices in keep_me, and unlist  
# flattens the list (i.e., unlist makes it a vector)
Map(`[`, Lst, keep_me)
[[1]]
[1] "A"   "B"   "C"   "D"   "E:F"

[[2]]
[1] "A"   "B"   "C"   "E:F" "G:H"

[[3]]
[1] "J"    "K"    "L"    "L:H"  "L:H1"

unlist(Map(`[`, Lst, keep_me))
 [1] "A"    "B"    "C"    "D"    "E:F"  "A"    "B"    "C"    "E:F"  "G:H"  "J"    "K"    "L"    "L:H"  "L:H1"

Upvotes: 1

nsinghphd

Reputation: 2022

Based on @Rui's answer I think this will do what you want

List1 <- c("A","B","C","D","E:F","F:E")
List2<- c("A","B","C","E:F","F:E","G:H","H:G")
List3 <- c("J","K","L","L:H","L:H1")

# make list of all objects starting with List
Lst <- mget(ls(pattern = "^List"))

# function to split, sort, and stitch the duplicates
split.sort <- function(x) {
  ifelse(length(x) > 1, paste0(sort(x), collapse = ":"), x)
}

# apply function to each of the Lst lists and remove duplicates
Lst <- lapply(Lst, function(y) unique(sapply(strsplit(y, ":"), split.sort)))

# get frequency
table(unlist(Lst))
#> 
#>    A    B    C    D  E:F  G:H  H:L H1:L    J    K    L 
#>    2    2    2    1    2    1    1    1    1    1    1

^{Created on 2019-04-17 by the reprex package (v0.2.1)}

Upvotes: 1

Count number of occurrence, with order of string counted only 1x

Answers (3)

Related Questions