Reputation: 1078
I am working with an R script that does not accept duplicates in a certain column because that column gets assigned to the rownames
of the data.frame
. Many of the rows of the data frame that I am working with, however, contain duplicate genes (gene_id
) and normalized expression values because they were sequenced in an exonic region of the same gene along the transcriptome.
> data.table(df) %>%
+ group_by(gene_id)
# A tibble: 138,930 x 544
# Groups: gene_id [22,672]
`#chr` start end gene_id `XXXXX… `XXXXX…
<fct> <int> <int> <fct> <dbl> <dbl>
1 chr20 290428 290429 ENSG0000019647… 0.830 2.54
2 chr20 290748 290749 ENSG0000019647… 0.830 2.54
3 chr20 290777 290778 ENSG0000019647… 0.830 2.54
4 chr20 296967 296968 ENSG0000024731… -0.0533 0.308
5 chr20 325233 325234 ENSG0000022537… -0.299 -0.274
6 chr20 325594 325595 ENSG0000017773… 0.246 1.98
7 chr20 346781 346782 ENSG0000012584… -0.156 -1.06
8 chr20 346882 346883 ENSG0000012584… -0.156 -1.06
9 chr20 347023 347024 ENSG0000012584… -0.156 -1.06
10 chr20 347104 347105 ENSG0000012584… -0.156 -1.06
I'd like to consolidate, for example, rows 1-3 and keep the lowest start
value and the highest end
value, and only get 1 resulting row. However, I don't know how to do this in dplyr
other than to start by group_by
the common column. What do you suggest/how would I go from here?
Upvotes: 1
Views: 39
Reputation: 886938
We can use slice_head
with n = 1
after updating the 'start', 'end' columns by the min
and max
(or first
and last
- if ordered) on the grouped data
library(dplyr)
df %>%
group_by(gene_id) %>%
mutate(start = min(start), end = max(end)) %>%
slice_head(n = 1) %>%
ungroup
Or use distinct
df %>%
group_by(gene_id) %>%
mutate(start = min(start), end = max(end)) %>%
ungroup %>%
distinct(gene_id, .keep_all = TRUE)
Upvotes: 2