CelineDion
CelineDion

Reputation: 1078

How to consolidate rows based on whether they share the same value in a key column?

I am working with an R script that does not accept duplicates in a certain column because that column gets assigned to the rownames of the data.frame. Many of the rows of the data frame that I am working with, however, contain duplicate genes (gene_id) and normalized expression values because they were sequenced in an exonic region of the same gene along the transcriptome.

> data.table(df) %>% 
+         group_by(gene_id)
# A tibble: 138,930 x 544
# Groups:   gene_id [22,672]
   `#chr`  start    end gene_id         `XXXXX… `XXXXX…
   <fct>   <int>  <int> <fct>                          <dbl>               <dbl>
 1 chr20  290428 290429 ENSG0000019647…               0.830                2.54 
 2 chr20  290748 290749 ENSG0000019647…               0.830                2.54 
 3 chr20  290777 290778 ENSG0000019647…               0.830                2.54 
 4 chr20  296967 296968 ENSG0000024731…              -0.0533               0.308
 5 chr20  325233 325234 ENSG0000022537…              -0.299               -0.274
 6 chr20  325594 325595 ENSG0000017773…               0.246                1.98 
 7 chr20  346781 346782 ENSG0000012584…              -0.156               -1.06 
 8 chr20  346882 346883 ENSG0000012584…              -0.156               -1.06 
 9 chr20  347023 347024 ENSG0000012584…              -0.156               -1.06 
10 chr20  347104 347105 ENSG0000012584…              -0.156               -1.06 

I'd like to consolidate, for example, rows 1-3 and keep the lowest start value and the highest end value, and only get 1 resulting row. However, I don't know how to do this in dplyr other than to start by group_by the common column. What do you suggest/how would I go from here?

Upvotes: 1

Views: 39

Answers (1)

akrun
akrun

Reputation: 886938

We can use slice_head with n = 1 after updating the 'start', 'end' columns by the min and max (or first and last - if ordered) on the grouped data

library(dplyr)
df %>%
    group_by(gene_id) %>%
    mutate(start = min(start), end = max(end)) %>%    
    slice_head(n = 1) %>%
    ungroup 

Or use distinct

df %>%
   group_by(gene_id) %>%
   mutate(start = min(start), end = max(end)) %>%
   ungroup %>%
   distinct(gene_id, .keep_all = TRUE)

Upvotes: 2

Related Questions