Extracting highest values from a column in a time series dataframe

Question

I have a data frame containing monthly NDVI values from 2000-2012 for 26 stations. I have sorted my dataframe first according to year, then station and lastly ndvi.

My dataframe R looks something like this (sorry about the formatting):

t  station  year  month ndvi   altitude precipitation  
8   a   2000    aug  0.7793 2143    592.9  
9   a   2000    sept 0.7524 2143    135.3  
10  a   2000    oct  0.6597 2143    77.5  
4   a   2000    apr 0.6029  2143    72.6  
7   a   2000    jul 0.6018  2143    606.1  
11  a   2000    nov 0.5801  2143    4.4  
12  a   2000    dec 0.5228  2143    0  
6   a   2000    jun 0.4969  2143    505.9  
5   a   2000    may 0.4756  2143    241.7  
2   a   2000    feb 0.4396  2143    4  
3   a   2000    mar 0.4393  2143    25.5  
1   a   2000    jan 0.4138  2143    16  
8   b   2000    aug 0.7523  122 832.3  
9   b   2000    sept    0.7003  122 229.7  
7   b   2000    jul 0.667   122 662  
5   b   2000    may 0.6639  122 323.3  
4   b   2000    apr 0.593   122 88.6  
6   b   2000    jun 0.5508  122 752.1

I need to extract the top three ndvi rows for each station for each year and tried using this code:

top3 <- split(R, R$station)
subsetted.data <- lapply(top3, FUN = function(x) head(x, 3))
subsetted.data
flatten.data <- do.call("rbind", subsetted.data)
View(flatten.data)

However, I then only get a data frame with the top three ndvi rows of stations in 2000 and not the years after.

Does anyone know how I can fix this?

Thank you.

BrodieG · Accepted Answer

You need to split by the interaction of station AND year:

R <- R[order(R$ndvi, decreasing=T), ]
top3 <- split(R, interaction(R$station, R$year))   # <<<<<<<<<< this is the change
subsetted.data <- lapply(top3, FUN = function(x) head(x, 3))
subsetted.data
flatten.data <- do.call("rbind", subsetted.data)

This works (see my data at the end). That said, this type of thing is much easier handled with packages like data.table:

library(data.table)
data.table(R)[order(ndvi, decreasing=T), head(.SD, 3), by=list(station, year)]

Note you can order data.tables faster by using keys, but I'm omitting that for the sake of clarity here.

Data:

set.seed(1)
R <- expand.grid(year=2000:2010, station=letters[1:5], month=month.abb)
R$ndvi <- runif(nrow(R))

Extracting highest values from a column in a time series dataframe

Answers (2)

Related Questions