Reputation: 89
I have a data frame as follows:
a <- c(1, 2, 3, 4)
b <- c("AA; AA; BC", "BC; DE", "AA; BC; BC", "DE; DE")
df <- data.frame(a,b)
I want to count the number of unique two-letter combinations in each string in column b. So the correct answer would be 2, 2, 2, 1.
If I create a vector outside of the df
test <- c("AA", "AA", "BC")
then
y <- length(stri_unique(test))
y correctly returns 2. But if I try to implement that in the df:
df <- mutate(df, new_column = length(stri_unique(df$b)))
It returns an integer of 1024 for every row, which is definitely not right; the right answer would be 2, 2, 2, 1. Trying to understand why it breaks like this. Have tried specifying sep = ";" but then I just get an error that 2 arguments are passed to length which takes one argument. Any advice appreciated.
Upvotes: 1
Views: 177
Reputation: 5694
Or by using Base R
df$Unq_count <- unlist(lapply(strsplit(df$b, ";\\s"), function(x) length(unique(x))))
a b Unq_count
1 1 AA; AA; BC 2
2 2 BC; DE 2
3 3 AA; BC; BC 2
4 4 DE; DE 1
Upvotes: 1
Reputation: 102760
A data.table
option using strsplit
+ uniqueN
> setDT(df)[, uniqCnt := sapply(strsplit(b, ";\\s"), uniqueN)][]
a b uniqCnt
1: 1 AA; AA; BC 2
2: 2 BC; DE 2
3: 3 AA; BC; BC 2
4: 4 DE; DE 1
Upvotes: 0
Reputation: 887941
We can split the string at the delimiter, apply the list
elements with stri_unique
and get the lengths
of the list
library(dplyr)
library(purrr)
library(stringi)
df %>%
mutate(new_column = lengths(map(strsplit(b, ";\\s*"), stri_unique)))
-output
# a b new_column
#1 1 AA; AA; BC 2
#2 2 BC; DE 2
#3 3 AA; BC; BC 2
#4 4 DE; DE 1
Upvotes: 2