Reputation: 107
I'm working with a dataframe that has two columns containing "words" (letter-digit-digit) as such:
df <- data.frame(A = c("V01, G30, V01, M05, M05, M05, N98", "J56, K45, L78, J56"),
B = c("J45, H67, M05, N54, M05, J76, V01", "S54, F32, K45"))
The strings are not always the same length and they contain different words. Some of the words are similar, some of them are even duplicated. What I would like to obtain is three new columns.
First column gives me the number of times each word appears in column A, second column gives me the number of times each words appears in column B and finally third column gives me the number of times each word appears in the intersection of columns A and B. The output should look like this:
df <- data.frame(A = c("V01, G30, V01, M05, M05, M05, N98", "J56, K45, L78, J56"),
B = c("J45, H67, M05, N54, M05, J76, V01", "S54, F32, K45"),
col1 = c("V01*2, G30*1, M05*3, N98*1", "J56*2, K45*1, L78*1"),
col2 = c("J45*1, H67*1, M05*2, N54*1, J76*1, V01*1", "S54*1, F32*1, K45*1"),
col3 = c("V01*1, M05*2", "K45*1"))
I do not really know where to start with this, so any help would be much appreciated!!!
Upvotes: 1
Views: 161
Reputation: 887911
An option would be
library(tidyverse)
f1 <- function(dat, colNm) {
dat %>%
select(colNm) %>%
rownames_to_column('rn') %>%
separate_rows(!! rlang::sym(colNm)) %>%
add_count(rn, !! rlang::sym(colNm)) %>%
unite(colN, !! rlang::sym(colNm), n, sep="*") %>%
group_by(rn) %>%
summarise_at(1, ~ toString(unique(.))) %>%
select(-rn)
}
df1 <- df %>%
mutate_all(str_extract_all, "\\w+") %>%
pmap_chr(~ keep(.y, .y %in% intersect(.x, .y)) %>%
toString) %>%
mutate(df, C = .)
map_dfc(names(df1), f1, dat = df1) %>%
rename_all(~ str_c("col", 1:3)) %>%
bind_cols(df, .)
# A B col1 col2
#1 V01, G30, V01, M05, M05, M05, N98 J45, H67, M05, N54, M05, J76, V01 V01*2, G30*1, M05*3, N98*1 J45*1, H67*1, M05*2, N54*1, J76*1, V01*1
#2 J56, K45, L78, J56 S54, F32, K45 J56*2, K45*1, L78*1 S54*1, F32*1, K45*1
# col3
#1 M05*2, V01*1
#2 K45*1
Upvotes: 3
Reputation: 11981
you can use a combination of sapply
and strsplit
like this:
myfun <- function(x){
x2 <- sort(table(x), decreasing = T)
paste(names(x2), x2, sep = "*", collapse = ", ")
}
df$col1 <- sapply(strsplit(df$A, split = ", "), myfun)
df$col2 <- sapply(strsplit(df$B, split = ", "), myfun)
df$col3 <- sapply(strsplit(paste(df$A, df$B, sep = ", "), split = ", "), myfun)
Note however, you should convert your columns to character
: you can do so by setting stringsAsFactors = F
inside data.frame
:
df <- data.frame(A = c("V01, G30, V01, M05, M05, M05, N98", "J56, K45, L78, J56"),
B = c("J45, H67, M05, N54, M05, J76, V01", "S54, F32, K45"),
stringsAsFactors = F)
Upvotes: 2
Reputation: 750
I know it is not the exact same structure of your output, but I would reconsider reshaping it for data analysis purposes. It is easier to see what you want ("How many times each word appears in column A?") if each unique word is a row in another dataframe.
library(dplyr)
df <- tibble(A = c("V01, G30, V01, M05, M05, M05, N98",
"J56, K45, L78, J56"),
B = c("J45, H67, M05, N54, M05, J76, V01",
"S54, F32, K45"))
words_A <- df$A %>%
strsplit(", ") %>%
unlist()
words_B <- df$B %>%
strsplit(", ") %>%
unlist()
tibble(word = unique(c(words_A, words_B))) %>%
group_by(word) %>%
summarise(n_A = sum(word == words_A),
n_B = sum(word == words_B),
n_AB = sum(word == intersect(words_A, words_B)),
n_AB_2 = min(n_A, n_B))
# A tibble: 13 x 5
word n_A n_B n_AB n_AB_2
<chr> <int> <int> <int> <int>
1 F32 0 1 0 0
2 G30 1 0 0 0
3 H67 0 1 0 0
4 J45 0 1 0 0
5 J56 2 0 0 0
6 J76 0 1 0 0
7 K45 1 1 1 1
8 L78 1 0 0 0
9 M05 3 2 1 2
10 N54 0 1 0 0
11 N98 1 0 0 0
12 S54 0 1 0 0
13 V01 2 1 1 1
Please, note that intersect()
(as in mathematical definition) removes duplicates. As you want to consider duplicates, I also defined the n_AB_2 = min(n_A, n_B)
column, which does the trick.
Upvotes: 0