Reputation: 109
I am looking for solutions using data.table ― I have a data.table with the following columns:
data <- data.frame(GROUP=c(3,3,4,4,5,6),
YEAR=c(1979,1985,1999,2011,2012,1994),
NAME=c("S","A","J","L","G","A"))
data <- as.data.table(data)
Data.table:
GROUP YEAR NAME
3 1979 Smith
3 1985 Anderson
4 1999 James
4 2011 Liam
5 2012 George
6 1994 Adams
For each group we want to select one row using the following rule:
Desired output:
GROUP YEAR NAME
3 1985 Anderson
4 2011 Liam
5 2012 George
6 1994 Adams
Thanks! I have been struggling with this for a while.
Upvotes: 3
Views: 2628
Reputation: 66819
You could also do a couple rolling joins:
res = unique(data[, .(GROUP)])
# get row with YEAR above 2000
res[, w := data[c(.SD, YEAR = 2000), on=.(GROUP, YEAR), roll=-Inf, which=TRUE]]
# if none found, get row with nearest YEAR below
res[is.na(w), w := data[c(.SD, YEAR = 2000), on=.(GROUP, YEAR), roll=Inf, which=TRUE]]
# subset by row numbers
data[res$w]
GROUP YEAR NAME
1: 3 1985 A
2: 4 2011 L
3: 5 2012 G
4: 6 1994 A
Upvotes: 3
Reputation: 93813
data.table
should be a lot simpler if you subset the special .I
row counter:
library(data.table)
setDT(data)
data[
data[
,
if(any(YEAR > 2000))
.I[which.min(2000 - YEAR)] else
.I[which.max(YEAR)],
by=GROUP
]$V1
]
# GROUP YEAR NAME
#1: 3 1985 A
#2: 4 2011 L
#3: 5 2012 G
#4: 6 1994 A
Thanks to @r2evans for the background info -
.I
is an integer vector equivalent toseq_len(nrow(x))
.
Ref: http://rdrr.io/cran/data.table/man/special-symbols.html
So, all I'm doing here is getting the matching row index for the whole of data
for each of the calculations at each by=
level. Then using these row indexes to subset data
again.
Upvotes: 8
Reputation: 2253
Using the dplyr
package I got your output like this (though it may not be the simplest answer):
library(dplyr)
library(magrittr)
data <- data.frame(GROUP=c(3,3,4,4,5,6),
YEAR=c(1979,1985,1999,2011,2012,1994),
NAME=c("S","A","J","L","G","A"))
data %>%
subset(YEAR < 2000) %>%
group_by(GROUP) %>%
summarise(MAX=max(YEAR)) %>%
join(data %>%
subset(YEAR > 2000) %>%
group_by(GROUP) %>%
summarise(MIN=min(YEAR)), type="full") %>%
mutate(YEAR=ifelse(is.na(MIN), MAX, MIN)) %>%
select(c(GROUP, YEAR)) %>%
join(data)
Results:
GROUP YEAR NAME
3 1985 A
4 2011 L
5 2012 G
6 1994 A
EDIT: Sorry, my first answer didn't take into account the min/max conditions. Hope this helps
Upvotes: 2