Reputation: 4949
Hi I would like to split a delimiter into new rows: there are a few similar posts on stackoverflow, however I can't find one that addresses issues with eliminating duplicates. I've tried several ways.
df <- read.table(text= 'sample, GENE1
s1 A,B
s4 B,A,A,C,C'
, header = TRUE, stringsAsFactors = FALSE)
df %>%
mutate(GENE1b = unique(strsplit(as.character(GENE1), ",")) ) %>%
unnest(GENE1b)
code above will only produce
# A tibble: 7 x 3
sample. GENE1 GENE1b
<chr> <chr> <chr>
1 s1 A,B A
2 s1 A,B B
3 s4 B,A,A,C,C B
4 s4 B,A,A,C,C A
5 s4 B,A,A,C,C A
6 s4 B,A,A,C,C C
7 s4 B,A,A,C,C C
which is incorrect. since s4 should only contain new rows for B,A,C and no duplicates. Of course I can always remove the duplicates afterwards but I'm wondering if there is a way to do it in one go. I also tried converting it back with paste ( x, collapse=",") but this also failed.
Upvotes: 2
Views: 392
Reputation: 78927
Add group_by(sample.) %>% distinct()
to your code:
library(dplyr)
library(tidyr)
df %>%
mutate(GENE1b = unique(strsplit(as.character(GENE1), ",")) ) %>%
unnest(GENE1b) %>%
group_by(sample.) %>%
distinct()
sample. GENE1 GENE1b
<chr> <chr> <chr>
1 s1 A,B A
2 s1 A,B B
3 s4 B,A,A,C,C B
4 s4 B,A,A,C,C A
5 s4 B,A,A,C,C C
Upvotes: 2
Reputation: 887088
It may be easier to do this with separate_rows
i.e. after the expansion, apply distinct
on both columns
library(dplyr)
library(tidyr)
df %>%
mutate(GENE1b = GENE1) %>%
separate_rows(GENE1b) %>%
distinct(sample., GENE1b, .keep_all = TRUE)
-output
# A tibble: 5 x 3
sample. GENE1 GENE1b
<chr> <chr> <chr>
1 s1 A,B A
2 s1 A,B B
3 s4 B,A,A,C,C B
4 s4 B,A,A,C,C A
5 s4 B,A,A,C,C C
Regarding the OP's issue with unique
- unique
expects a vector as input whereas strsplit
output is a list
. Therefore, we may need to loop over the list
or use unlist
(but this will not work as unlist
will get all the list
elements to a single vector.
library(dplyr)
library(purrr)
library(tidyr)
df %>%
mutate(GENE1b = map(strsplit(GENE1, ","), unique)) %>%
unnest(GENE1b)
-output
# A tibble: 5 x 3
sample. GENE1 GENE1b
<chr> <chr> <chr>
1 s1 A,B A
2 s1 A,B B
3 s4 B,A,A,C,C B
4 s4 B,A,A,C,C A
5 s4 B,A,A,C,C C
Upvotes: 2