Reputation: 59
I have a confusion regarding how the mutate in tidyverse/ dplyr works. I have included a reproducible example here. One uses mutate and one uses a loop. I would expect both to give the same result, but they do not. I have no idea why. Any help would be appreciated.
library(tidyverse)
d <- data.frame(x = c('a,a,b,b,b','a,a','a,b,b,b,c,c,c'))
# Approach 1 (mutate)
d %>%
mutate(y = paste(unique(str_split(x, ',')[[1]]), collapse = ','))
d
# Approach 2 (loop)
for (i in 1:nrow(d))
{
d$y[i] <- paste(unique(str_split(d$x[i], ',')[[1]]), collapse = ',')
}
d
I expect output to be the same for both approaches, but they are not.
Upvotes: 1
Views: 54
Reputation: 887048
Issue is that we are subsetting only first list
element with [[1]]
and then the unique
is only on that element. Instead, we need to loop through the list
(from str_split
output)
library(tidyverse)
d %>%
mutate(y = str_split(x, ',') %>% # output is a list
map_chr(~ unique(.x) %>% # loop with map, get the unique elements
toString)) # paste the strings together
# x y
#1 a,a,b,b,b a, b
#2 a,a a
#3 a,b,b,b,c,c,c a, b, c
In the for
loop, it was not the case because the splitting was done one element at a time str_split(d$x[i]
To understand better, the str_split
(strsplit
base R) is vectorized. They can take multiple strings and split into a
listof
vector`s equal to the length of the intial vector
str_split(d$x, ',') # list of length 3
#[[1]]
#[1] "a" "a" "b" "b" "b"
#[[2]]
#[1] "a" "a"
#[[3]]
#[1] "a" "b" "b" "b" "c" "c" "c"
Extracting the first [[1]]
str_split(d$x, ',')[[1]]
#[1] "a" "a" "b" "b" "b"
In the for
loop, we are individually splitting the elements and extract the list (length 1) element
str_split(d$x[1], ',')[[1]]
#[1] "a" "a" "b" "b" "b"
str_split(d$x[2], ',')[[1]]
#[1] "a" "a"
That is the reason, we need to loop over the list
and then get the unique
from each of the elements
Upvotes: 1