user237554
user237554

Reputation: 89

counting delimited unique strings in a data frame in R

I have a data frame as follows:

a <- c(1, 2, 3, 4)
b <- c("AA; AA; BC", "BC; DE", "AA; BC; BC", "DE; DE")
df <- data.frame(a,b)

I want to count the number of unique two-letter combinations in each string in column b. So the correct answer would be 2, 2, 2, 1.

If I create a vector outside of the df

test <- c("AA", "AA", "BC")

then

y <- length(stri_unique(test))

y correctly returns 2. But if I try to implement that in the df:

df <- mutate(df, new_column = length(stri_unique(df$b)))

It returns an integer of 1024 for every row, which is definitely not right; the right answer would be 2, 2, 2, 1. Trying to understand why it breaks like this. Have tried specifying sep = ";" but then I just get an error that 2 arguments are passed to length which takes one argument. Any advice appreciated.

Upvotes: 1

Views: 177

Answers (3)

Agaz Wani
Agaz Wani

Reputation: 5694

Or by using Base R

df$Unq_count <-  unlist(lapply(strsplit(df$b, ";\\s"), function(x) length(unique(x))))

  a          b Unq_count
1 1 AA; AA; BC         2
2 2     BC; DE         2
3 3 AA; BC; BC         2
4 4     DE; DE         1

Upvotes: 1

ThomasIsCoding
ThomasIsCoding

Reputation: 102760

A data.table option using strsplit + uniqueN

> setDT(df)[, uniqCnt := sapply(strsplit(b, ";\\s"), uniqueN)][]
   a          b uniqCnt
1: 1 AA; AA; BC       2
2: 2     BC; DE       2
3: 3 AA; BC; BC       2
4: 4     DE; DE       1

Upvotes: 0

akrun
akrun

Reputation: 887941

We can split the string at the delimiter, apply the list elements with stri_unique and get the lengths of the list

library(dplyr)
library(purrr)  
library(stringi)  
df %>% 
    mutate(new_column = lengths(map(strsplit(b, ";\\s*"), stri_unique)))

-output

# a          b new_column
#1 1 AA; AA; BC          2
#2 2     BC; DE          2
#3 3 AA; BC; BC          2
#4 4     DE; DE          1

Upvotes: 2

Related Questions