Nono_sad
Nono_sad

Reputation: 443

R group rows based on a value of specific column and keep lower and greater value on others columns

I have a dataframe

library(data.table)

test_df=data.frame(chr=c("chr1","chr1","chr1","chr2","chr2","chr1","chr1","chr2","chr1"),
                   start=c(700,800,900,350,400,650,800,200,1000),
                   end=c(750,850,950,400,450,700,850,250,1050),
                   id=c("id_1","id_1","id_1","id_2","id_2","id_1","id_1","id_2","id_3"))

I'm trying to groups the rows by ID column. For rows which have same values in chr and id, keep the lower value of start column and greater value for end column.

I expect for this dataframe something like :

chr     start     end     id
chr1     650      950     id_1
chr1     1000     1050    id_3
chr2     200      450     id_2

I tried this :

final_test_df=setDF(setDT(test_df)[, .(chr=chr[1],start=start[1], end=end[.N]),by=id])

and it gives me the first and last value for same id, not the lower/greater values.

id     chr     start     end
id_1   chr1     700      850     
id_2   chr2     350      250 
id_3   chr1     1000     1050    

Upvotes: 1

Views: 1962

Answers (3)

ddggdd
ddggdd

Reputation: 1

test_df=data.frame(chr=c("chr1","chr1","chr1","chr2","chr2","chr1","chr1","chr2"),
                   start=c(700,800,900,350,400,650,800,200),
                   end=c(750,850,950,400,450,700,850,250),
                   id=c("id_1","id_1","id_1","id_2","id_2","id_1","id_1","id_2"))

setDT(test_df)
test_df[,.(chr=chr[1],start=start[order(start)][1], end=end[order(end)][.N]), by=id]

using the original data,id and chr seems duplicated info

Upvotes: 0

zx8754
zx8754

Reputation: 56149

Group by both chr and id, then get min/max for star/end:

test_df[, .(start = min(start), end = max(end)), by = .(id, chr) ]
#     id  chr start  end
#1: id_1 chr1   650  950
#2: id_2 chr2   200  450
#3: id_3 chr1  1000 1050

Same using dplyr:

test_df %>%
  group_by(id, chr) %>% 
  summarise(start = min(start), end = max(end))
# # A tibble: 3 x 4
# # Groups:   id [2]
#   id    chr   start   end
#   <fct> <fct> <dbl> <dbl>
# 1 id_1  chr1    650   950
# 2 id_2  chr2    200   450
# 3 id_3  chr1   1000  1050

Upvotes: 2

Areza
Areza

Reputation: 6080

I recommend using a dplyr package,

test_df %>% group_by(chr) %>% 
mutate(mymin = min(start), mymax = max(end), myunique = unique(id)) %>% 
select(mymin, mymax, myunique) %>% distinct

Upvotes: 1

Related Questions