dentist_inedible
dentist_inedible

Reputation: 321

dplyr: Collapse rows that may not be present

I am working with biological sequence data, in GTF format. Here is a simple example of the format:

start   stop   type         name 
1       90     exon         transcript_1_exon_1
12      15     start_codon  transcript_1_exon_1
100     160    exon         transcript_1_exon_2
190     250    exon         transcript_1_exon_3
217     220    stop_codon   transcript_1_exon_3

I am trying to convert exons to their protein sequences. However, some parts of the exons are not protein-coding. This is indicated by the presence of a row with the type field set to start_codon or stop_codon.

I would like to move the start and stop of these features, respectively, into their own columns when present, as follows:

start   stop  type         name                 start_codon  stop_codon
1       90    exon         transcript_1_exon_1  12           NA
100     160   exon         transcript_1_exon_2  NA           NA
190     250   exon         transcript_1_exon_3  NA           220

However, I can't figure out how to do this in R. The closest I've come using dplyr is:

gtf3 <- gtf2 %>% group_by(feature_name) %>% summarise(
  start_codon = ifelse(sum(type == "start_codon") != 0, start[type == "start_codon"], NA),
  stop_codon = ifelse(sum(type == "stop_codon") != 0, stop[type == "stop_codon"], NA))

but this gives me the following error: Evaluation error: object of type 'closure' is not subsettable.

How can I move the start and end of the start/stop codons, respectively, into their own columns when they are present?

Upvotes: 0

Views: 102

Answers (1)

moodymudskipper
moodymudskipper

Reputation: 47320

Here's a way to do it:

df1 %>% filter(type=="exon") %>%
  left_join(df1 %>% 
              filter(type=="start_codon") %>% 
              select(-type,-stop),by="name",suffix = c("","_codon")) %>%
  left_join(df1 %>%  
              filter(type=="stop_codon") %>% 
              select(-type,-start),by="name",suffix = c("","_codon"))

#   start stop type                name start_codon stop_codon
# 1     1   90 exon transcript_1_exon_1          12         NA
# 2   100  160 exon transcript_1_exon_2          NA         NA
# 3   190  250 exon transcript_1_exon_3          NA        220

Upvotes: 1

Related Questions