Reputation: 41
Here is my toy dataframe, the real one might have 40K-1M records and five additional columns
animal1 version1 animal2 version2 sim
53 20154620 TRUSEQ.v1 20104647 F250v1 0.3663569
854 20145687 TRUSEQ.v1 20105551 F250v1 0.5732854
3662 20154620 TRUSEQ.v1 20114509 F250v1 0.3374918
4063 20154620 TRUSEQ.v1 20114578 F250v1 0.3732692
4464 20154620 TRUSEQ.v1 20114595 F250v1 0.3772367
5262 20144516 TRUSEQ.v1 20115051 770k.v1 0.6034206
5663 20144516 TRUSEQ.v1 20115051 F250v1 0.6164795
5664 20145008 TRUSEQ.v1 20115051 F250v1 0.3146651
6064 20144516 TRUSEQ.v1 20115059 F250v1 0.3043295
6471 20165119 F250v1 20115096 F250v1 0.388435
9677 20154620 TRUSEQ.v1 20118095 F250v1 0.3079702
11281 20154620 TRUSEQ.v1 20134529 F250v1 0.3188631
12486 20165119 F250v1 20135032 F250v1 0.6091486
13282 20144516 TRUSEQ.v1 20135047 F250v1 0.3098507
14090 20165119 F250v1 20135072 F250v1 0.3025007
14892 20165119 F250v1 20135122 F250v1 0.345238
For each animal1, I need all rows featuring the top 3 unique animal2 values by highest sim... so my desired result is reproduced below.
animal1 version1 animal2 version2 sim
5663 20144516 TRUSEQ.v1 20115051 F250v1 0.6164795
5262 20144516 TRUSEQ.v1 20115051 770k.v1 0.6034206
13282 20144516 TRUSEQ.v1 20135047 F250v1 0.3098507
6064 20144516 TRUSEQ.v1 20115059 F250v1 0.3043295
5664 20145008 TRUSEQ.v1 20115051 F250v1 0.3146651
854 20145687 TRUSEQ.v1 20105551 F250v1 0.5732854
4464 20154620 TRUSEQ.v1 20114595 F250v1 0.3772367
4063 20154620 TRUSEQ.v1 20114578 F250v1 0.3732692
53 20154620 TRUSEQ.v1 20104647 F250v1 0.3663569
12486 20165119 F250v1 20135032 F250v1 0.6091486
6471 20165119 F250v1 20115096 F250v1 0.388435
14892 20165119 F250v1 20135122 F250v1 0.345238
So in the subset, each animal1 might have between 1 and 20 observations but will have <=n unique values of animal2 where n=3 in this case.
I can sort the df by sim and animal1 like this
mydf <- mydf[order(-xtfrm(mydf[,"animal1"]), -mydf[,"sim"]),]
I can grab the first n observations per animal1 like this
mydf2 <- by(mydf, mydf["animal1"], head, n=1)
mydf2 <- Reduce(rbind, mydf2)
But how do I apply n to a third column, animal2 rather than the number of observations? Apologies if this is a duplicate, the answer is probably hidden in here, how to find the top N values by group or within category (groupwise) in an R data.frame but I just can't seem to stitch together a solution to my problem from the answers.
Upvotes: 0
Views: 117
Reputation: 581
The code below will include multiple animal1
-animal2
combinations, only if the sim
value for the lower animal1
-animal2
entry "would be" in the top 3. Let me know if I've misinterpreted.
library(dplyr)
selected <- dat %>%
arrange(animal1,animal2,desc(sim)) %>%
group_by(animal1,animal2) %>%
mutate(rank=row_number()) %>%
filter(rank==1) %>% ungroup() %>%
group_by(animal1) %>%
top_n(3,sim) %>%
summarise(threshold = min(sim))
out <- dat %>%
inner_join(selected, by = c("animal1"="animal1")) %>%
filter(sim>=threshold) %>%
arrange(animal1,animal2,desc(sim)) %>%
select(-threshold)
> out
animal1 version1 animal2 version2 sim
1 20144516 TRUSEQ.v1 20115051 F250v1 0.6164795
2 20144516 TRUSEQ.v1 20115051 770k.v1 0.6034206
3 20144516 TRUSEQ.v1 20115059 F250v1 0.3043295
4 20144516 TRUSEQ.v1 20135047 F250v1 0.3098507
5 20145008 TRUSEQ.v1 20115051 F250v1 0.3146651
6 20145687 TRUSEQ.v1 20105551 F250v1 0.5732854
7 20154620 TRUSEQ.v1 20104647 F250v1 0.3663569
8 20154620 TRUSEQ.v1 20114578 F250v1 0.3732692
9 20154620 TRUSEQ.v1 20114595 F250v1 0.3772367
10 20165119 F250v1 20115096 F250v1 0.3884350
11 20165119 F250v1 20135032 F250v1 0.6091486
12 20165119 F250v1 20135122 F250v1 0.3452380
Upvotes: 0
Reputation: 160687
Notwithstanding my question about animal "20144516", here are a couple solutions, using dat
as your sample data above (included at the bottom for reproducibility). I'm offering base-R and dplyr
, though as suggested in a comment by @Balter, there is likely a straight-forward data.table
methodology as well.
# ordering by animal1 is not necessary, sim is priority
dat <- dat[rev(order(dat$sim)),]
dat2 <- do.call(rbind, by(dat, list(dat$animal1, dat$animal2), head, n = 1))
# ... but we need to re-sort by sim, since the ordering is lost with `by`
dat2 <- dat2[rev(order(dat2$sim)),]
head(dat2)
# animal1 version1 animal2 version2 sim
# 5663 20144516 TRUSEQ.v1 20115051 F250v1 0.6164795
# 12486 20165119 F250v1 20135032 F250v1 0.6091486
# 854 20145687 TRUSEQ.v1 20105551 F250v1 0.5732854
# 6471 20165119 F250v1 20115096 F250v1 0.3884350
# 4464 20154620 TRUSEQ.v1 20114595 F250v1 0.3772367
# 4063 20154620 TRUSEQ.v1 20114578 F250v1 0.3732692
This gives us the top 1 pairing of animal1
and animal2
, sorted (descending) by sim
. Now we effectively repeat the process with animal1
only:
dat3 <- do.call(rbind, by(dat, list(dat$animal1), head, n = 3))
dat3
# animal1 version1 animal2 version2 sim
# 20144516.5663 20144516 TRUSEQ.v1 20115051 F250v1 0.6164795
# 20144516.5262 20144516 TRUSEQ.v1 20115051 770k.v1 0.6034206
# 20144516.13282 20144516 TRUSEQ.v1 20135047 F250v1 0.3098507
# 20145008 20145008 TRUSEQ.v1 20115051 F250v1 0.3146651
# 20145687 20145687 TRUSEQ.v1 20105551 F250v1 0.5732854
# 20154620.4464 20154620 TRUSEQ.v1 20114595 F250v1 0.3772367
# 20154620.4063 20154620 TRUSEQ.v1 20114578 F250v1 0.3732692
# 20154620.53 20154620 TRUSEQ.v1 20104647 F250v1 0.3663569
# 20165119.12486 20165119 F250v1 20135032 F250v1 0.6091486
# 20165119.6471 20165119 F250v1 20115096 F250v1 0.3884350
# 20165119.14892 20165119 F250v1 20135122 F250v1 0.3452380
(The rownames are munged, unfortunately. If they are meaningful, I suggest you place rownames(dat)
into a column and preserve it there.)
dplyr
You can also use dplyr
.
library(dplyr)
dat %>%
group_by(animal1, animal2) %>%
top_n(1, wt = sim) %>%
group_by(animal1) %>%
top_n(3, wt = sim) %>%
ungroup()
# # A tibble: 11 × 5
# animal1 version1 animal2 version2 sim
# <int> <fctr> <int> <fctr> <dbl>
# 1 20144516 TRUSEQ.v1 20115051 F250v1 0.6164795
# 2 20165119 F250v1 20135032 F250v1 0.6091486
# 3 20145687 TRUSEQ.v1 20105551 F250v1 0.5732854
# 4 20165119 F250v1 20115096 F250v1 0.3884350
# 5 20154620 TRUSEQ.v1 20114595 F250v1 0.3772367
# 6 20154620 TRUSEQ.v1 20114578 F250v1 0.3732692
# 7 20154620 TRUSEQ.v1 20104647 F250v1 0.3663569
# 8 20165119 F250v1 20135122 F250v1 0.3452380
# 9 20145008 TRUSEQ.v1 20115051 F250v1 0.3146651
# 10 20144516 TRUSEQ.v1 20135047 F250v1 0.3098507
# 11 20144516 TRUSEQ.v1 20115059 F250v1 0.3043295
PS: it can be significantly faster to use do.call(rbind, ...)
in place of Reduce(rbind, ...)
:
library(microbenchmark)
x <- by(dat, list(dat$animal1, dat$animal2), head, n = 1)
microbenchmark(
docall = do.call(rbind, x),
reduce = Reduce(rbind, x)
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# docall 1.418577 1.493335 1.809469 1.551136 1.731466 5.216277 100
# reduce 11.119961 11.829614 13.302388 12.727255 13.401535 26.897520 100
This difference increases with a higher number of distinct animals. (This is because the Reduce
method calls rbind
once for each unique animal, whereas do.call
is calling rbind
only once.)
The sample data used here:
dat <- structure(list(animal1 = c(20154620L, 20145687L, 20154620L, 20154620L,
20154620L, 20144516L, 20144516L, 20145008L, 20144516L, 20165119L,
20154620L, 20154620L, 20165119L, 20144516L, 20165119L, 20165119L
), version1 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 2L, 2L, 1L, 2L, 1L, 1L), .Label = c("F250v1", "TRUSEQ.v1"
), class = "factor"), animal2 = c(20104647L, 20105551L, 20114509L,
20114578L, 20114595L, 20115051L, 20115051L, 20115051L, 20115059L,
20115096L, 20118095L, 20134529L, 20135032L, 20135047L, 20135072L,
20135122L), version2 = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("770k.v1", "F250v1"
), class = "factor"), sim = c(0.3663569, 0.5732854, 0.3374918,
0.3732692, 0.3772367, 0.6034206, 0.6164795, 0.3146651, 0.3043295,
0.388435, 0.3079702, 0.3188631, 0.6091486, 0.3098507, 0.3025007,
0.345238)), .Names = c("animal1", "version1", "animal2", "version2",
"sim"), class = "data.frame", row.names = c("53", "854", "3662",
"4063", "4464", "5262", "5663", "5664", "6064", "6471", "9677",
"11281", "12486", "13282", "14090", "14892"))
Upvotes: 1