Compare strings from two different columns where each cell contains duplicated values and count them in R

I'm working with a dataframe that has two columns containing "words" (letter-digit-digit) as such:

df <- data.frame(A = c("V01, G30, V01, M05, M05, M05, N98", "J56, K45, L78, J56"),
                 B = c("J45, H67, M05, N54, M05, J76, V01", "S54, F32, K45"))

The strings are not always the same length and they contain different words. Some of the words are similar, some of them are even duplicated. What I would like to obtain is three new columns.

First column gives me the number of times each word appears in column A, second column gives me the number of times each words appears in column B and finally third column gives me the number of times each word appears in the intersection of columns A and B. The output should look like this:

df <- data.frame(A = c("V01, G30, V01, M05, M05, M05, N98", "J56, K45, L78, J56"),
                 B = c("J45, H67, M05, N54, M05, J76, V01", "S54, F32, K45"), 
                 col1 = c("V01*2, G30*1, M05*3, N98*1", "J56*2, K45*1, L78*1"),
                 col2 = c("J45*1, H67*1, M05*2, N54*1, J76*1, V01*1", "S54*1, F32*1, K45*1"), 
                 col3 = c("V01*1, M05*2", "K45*1"))

I do not really know where to start with this, so any help would be much appreciated!!!

Upvotes: 1

Answers (3)

akrun

Reputation: 887911

An option would be

library(tidyverse)
f1 <-   function(dat, colNm) {
          dat %>%
              select(colNm) %>%
              rownames_to_column('rn') %>% 
              separate_rows(!! rlang::sym(colNm)) %>% 
              add_count(rn, !! rlang::sym(colNm)) %>%
              unite(colN, !! rlang::sym(colNm), n, sep="*") %>% 
              group_by(rn) %>% 
              summarise_at(1, ~ toString(unique(.)))  %>%
              select(-rn)


}
df1 <- df %>%  
          mutate_all(str_extract_all, "\\w+") %>% 
          pmap_chr(~ keep(.y,  .y %in% intersect(.x, .y)) %>% 
                    toString) %>%
          mutate(df, C = .)
map_dfc(names(df1), f1, dat = df1) %>%
    rename_all(~ str_c("col", 1:3)) %>%
    bind_cols(df, .)
#                               A                                 B                       col1                                     col2
#1 V01, G30, V01, M05, M05, M05, N98 J45, H67, M05, N54, M05, J76, V01 V01*2, G30*1, M05*3, N98*1 J45*1, H67*1, M05*2, N54*1, J76*1, V01*1
#2                J56, K45, L78, J56                     S54, F32, K45        J56*2, K45*1, L78*1                      S54*1, F32*1, K45*1
#          col3
#1 M05*2, V01*1
#2        K45*1

Upvotes: 3

Cettt

Reputation: 11981

you can use a combination of sapply and strsplit like this:

myfun <- function(x){
  x2 <- sort(table(x), decreasing = T)
  paste(names(x2), x2, sep = "*", collapse = ", ")
}

df$col1 <- sapply(strsplit(df$A, split = ", "), myfun)
df$col2 <- sapply(strsplit(df$B, split = ", "), myfun)
df$col3 <- sapply(strsplit(paste(df$A, df$B, sep = ", "), split = ", "), myfun)

Note however, you should convert your columns to character: you can do so by setting stringsAsFactors = F inside data.frame:

 df <- data.frame(A = c("V01, G30, V01, M05, M05, M05, N98", "J56, K45, L78, J56"),     
                  B = c("J45, H67, M05, N54, M05, J76, V01", "S54, F32, K45"), 
                  stringsAsFactors = F)

Upvotes: 2

Gabriel M. Silva

Reputation: 750

I know it is not the exact same structure of your output, but I would reconsider reshaping it for data analysis purposes. It is easier to see what you want ("How many times each word appears in column A?") if each unique word is a row in another dataframe.

library(dplyr)

df <- tibble(A = c("V01, G30, V01, M05, M05, M05, N98", 
                   "J56, K45, L78, J56"),
             B = c("J45, H67, M05, N54, M05, J76, V01", 
                   "S54, F32, K45"))

words_A <- df$A %>% 
  strsplit(", ") %>%
  unlist()

words_B <- df$B %>% 
  strsplit(", ") %>%
  unlist()

tibble(word = unique(c(words_A, words_B))) %>%
  group_by(word) %>%
  summarise(n_A = sum(word == words_A),
            n_B = sum(word == words_B),
            n_AB = sum(word == intersect(words_A, words_B)),
            n_AB_2 = min(n_A, n_B))

# A tibble: 13 x 5
   word    n_A   n_B  n_AB n_AB_2
   <chr> <int> <int> <int>  <int>
 1 F32       0     1     0      0
 2 G30       1     0     0      0
 3 H67       0     1     0      0
 4 J45       0     1     0      0
 5 J56       2     0     0      0
 6 J76       0     1     0      0
 7 K45       1     1     1      1
 8 L78       1     0     0      0
 9 M05       3     2     1      2
10 N54       0     1     0      0
11 N98       1     0     0      0
12 S54       0     1     0      0
13 V01       2     1     1      1

Please, note that intersect() (as in mathematical definition) removes duplicates. As you want to consider duplicates, I also defined the n_AB_2 = min(n_A, n_B) column, which does the trick.

Upvotes: 0

Compare strings from two different columns where each cell contains duplicated values and count them in R

Answers (3)

Related Questions