Ahdee
Ahdee

Reputation: 4949

Split delimiter ( but only unique values ) into new rows in one go?

Hi I would like to split a delimiter into new rows: there are a few similar posts on stackoverflow, however I can't find one that addresses issues with eliminating duplicates. I've tried several ways.

df <- read.table(text= 'sample, GENE1
s1 A,B
s4 B,A,A,C,C'
, header = TRUE, stringsAsFactors = FALSE)

df %>%
  mutate(GENE1b =  unique(strsplit(as.character(GENE1), ",")) )  %>%
  unnest(GENE1b)    

code above will only produce

# A tibble: 7 x 3
  sample. GENE1     GENE1b
  <chr>   <chr>     <chr> 
1 s1      A,B       A     
2 s1      A,B       B     
3 s4      B,A,A,C,C B     
4 s4      B,A,A,C,C A     
5 s4      B,A,A,C,C A     
6 s4      B,A,A,C,C C     
7 s4      B,A,A,C,C C    

which is incorrect. since s4 should only contain new rows for B,A,C and no duplicates. Of course I can always remove the duplicates afterwards but I'm wondering if there is a way to do it in one go. I also tried converting it back with paste ( x, collapse=",") but this also failed.

Upvotes: 2

Views: 392

Answers (2)

TarJae
TarJae

Reputation: 78927

Add group_by(sample.) %>% distinct() to your code:

library(dplyr)
library(tidyr)

    df %>%
        mutate(GENE1b =  unique(strsplit(as.character(GENE1), ",")) )  %>%
        unnest(GENE1b) %>% 
        group_by(sample.) %>% 
        distinct()    
  sample. GENE1     GENE1b
  <chr>   <chr>     <chr> 
1 s1      A,B       A     
2 s1      A,B       B     
3 s4      B,A,A,C,C B     
4 s4      B,A,A,C,C A     
5 s4      B,A,A,C,C C   

Upvotes: 2

akrun
akrun

Reputation: 887088

It may be easier to do this with separate_rows i.e. after the expansion, apply distinct on both columns

library(dplyr)
library(tidyr)
df %>% 
    mutate(GENE1b = GENE1) %>%
    separate_rows(GENE1b) %>% 
    distinct(sample., GENE1b, .keep_all = TRUE)

-output

# A tibble: 5 x 3
  sample. GENE1     GENE1b
  <chr>   <chr>     <chr> 
1 s1      A,B       A     
2 s1      A,B       B     
3 s4      B,A,A,C,C B     
4 s4      B,A,A,C,C A     
5 s4      B,A,A,C,C C      

Regarding the OP's issue with unique - unique expects a vector as input whereas strsplit output is a list. Therefore, we may need to loop over the list or use unlist (but this will not work as unlist will get all the list elements to a single vector.

library(dplyr)
library(purrr)
library(tidyr)
df %>%
    mutate(GENE1b = map(strsplit(GENE1, ","), unique)) %>%
    unnest(GENE1b)

-output

# A tibble: 5 x 3
  sample. GENE1     GENE1b
  <chr>   <chr>     <chr> 
1 s1      A,B       A     
2 s1      A,B       B     
3 s4      B,A,A,C,C B     
4 s4      B,A,A,C,C A     
5 s4      B,A,A,C,C C   

Upvotes: 2

Related Questions