Reputation: 321
I am working with biological sequence data, in GTF format. Here is a simple example of the format:
start stop type name
1 90 exon transcript_1_exon_1
12 15 start_codon transcript_1_exon_1
100 160 exon transcript_1_exon_2
190 250 exon transcript_1_exon_3
217 220 stop_codon transcript_1_exon_3
I am trying to convert exons to their protein sequences. However, some parts of the exons are not protein-coding. This is indicated by the presence of a row with the type
field set to start_codon
or stop_codon
.
I would like to move the start and stop of these features, respectively, into their own columns when present, as follows:
start stop type name start_codon stop_codon
1 90 exon transcript_1_exon_1 12 NA
100 160 exon transcript_1_exon_2 NA NA
190 250 exon transcript_1_exon_3 NA 220
However, I can't figure out how to do this in R. The closest I've come using dplyr
is:
gtf3 <- gtf2 %>% group_by(feature_name) %>% summarise(
start_codon = ifelse(sum(type == "start_codon") != 0, start[type == "start_codon"], NA),
stop_codon = ifelse(sum(type == "stop_codon") != 0, stop[type == "stop_codon"], NA))
but this gives me the following error: Evaluation error: object of type 'closure' is not subsettable.
How can I move the start and end of the start/stop codons, respectively, into their own columns when they are present?
Upvotes: 0
Views: 102
Reputation: 47320
Here's a way to do it:
df1 %>% filter(type=="exon") %>%
left_join(df1 %>%
filter(type=="start_codon") %>%
select(-type,-stop),by="name",suffix = c("","_codon")) %>%
left_join(df1 %>%
filter(type=="stop_codon") %>%
select(-type,-start),by="name",suffix = c("","_codon"))
# start stop type name start_codon stop_codon
# 1 1 90 exon transcript_1_exon_1 12 NA
# 2 100 160 exon transcript_1_exon_2 NA NA
# 3 190 250 exon transcript_1_exon_3 NA 220
Upvotes: 1