Reputation: 443
I have a dataframe
library(data.table)
test_df=data.frame(chr=c("chr1","chr1","chr1","chr2","chr2","chr1","chr1","chr2","chr1"),
start=c(700,800,900,350,400,650,800,200,1000),
end=c(750,850,950,400,450,700,850,250,1050),
id=c("id_1","id_1","id_1","id_2","id_2","id_1","id_1","id_2","id_3"))
I'm trying to groups the rows by ID column. For rows which have same values in chr and id, keep the lower value of start column and greater value for end column.
I expect for this dataframe something like :
chr start end id
chr1 650 950 id_1
chr1 1000 1050 id_3
chr2 200 450 id_2
I tried this :
final_test_df=setDF(setDT(test_df)[, .(chr=chr[1],start=start[1], end=end[.N]),by=id])
and it gives me the first and last value for same id, not the lower/greater values.
id chr start end
id_1 chr1 700 850
id_2 chr2 350 250
id_3 chr1 1000 1050
Upvotes: 1
Views: 1962
Reputation: 1
test_df=data.frame(chr=c("chr1","chr1","chr1","chr2","chr2","chr1","chr1","chr2"),
start=c(700,800,900,350,400,650,800,200),
end=c(750,850,950,400,450,700,850,250),
id=c("id_1","id_1","id_1","id_2","id_2","id_1","id_1","id_2"))
setDT(test_df)
test_df[,.(chr=chr[1],start=start[order(start)][1], end=end[order(end)][.N]), by=id]
using the original data,id and chr seems duplicated info
Upvotes: 0
Reputation: 56149
Group by both chr and id, then get min/max for star/end:
test_df[, .(start = min(start), end = max(end)), by = .(id, chr) ]
# id chr start end
#1: id_1 chr1 650 950
#2: id_2 chr2 200 450
#3: id_3 chr1 1000 1050
Same using dplyr:
test_df %>%
group_by(id, chr) %>%
summarise(start = min(start), end = max(end))
# # A tibble: 3 x 4
# # Groups: id [2]
# id chr start end
# <fct> <fct> <dbl> <dbl>
# 1 id_1 chr1 650 950
# 2 id_2 chr2 200 450
# 3 id_3 chr1 1000 1050
Upvotes: 2
Reputation: 6080
I recommend using a dplyr package,
test_df %>% group_by(chr) %>%
mutate(mymin = min(start), mymax = max(end), myunique = unique(id)) %>%
select(mymin, mymax, myunique) %>% distinct
Upvotes: 1