user3460660
user3460660

Reputation: 9

Extracting highest values from a column in a time series dataframe

I have a data frame containing monthly NDVI values from 2000-2012 for 26 stations. I have sorted my dataframe first according to year, then station and lastly ndvi.

My dataframe R looks something like this (sorry about the formatting):

t  station  year  month ndvi   altitude precipitation  
8   a   2000    aug  0.7793 2143    592.9  
9   a   2000    sept 0.7524 2143    135.3  
10  a   2000    oct  0.6597 2143    77.5  
4   a   2000    apr 0.6029  2143    72.6  
7   a   2000    jul 0.6018  2143    606.1  
11  a   2000    nov 0.5801  2143    4.4  
12  a   2000    dec 0.5228  2143    0  
6   a   2000    jun 0.4969  2143    505.9  
5   a   2000    may 0.4756  2143    241.7  
2   a   2000    feb 0.4396  2143    4  
3   a   2000    mar 0.4393  2143    25.5  
1   a   2000    jan 0.4138  2143    16  
8   b   2000    aug 0.7523  122 832.3  
9   b   2000    sept    0.7003  122 229.7  
7   b   2000    jul 0.667   122 662  
5   b   2000    may 0.6639  122 323.3  
4   b   2000    apr 0.593   122 88.6  
6   b   2000    jun 0.5508  122 752.1  

I need to extract the top three ndvi rows for each station for each year and tried using this code:

top3 <- split(R, R$station)
subsetted.data <- lapply(top3, FUN = function(x) head(x, 3))
subsetted.data
flatten.data <- do.call("rbind", subsetted.data)
View(flatten.data)

However, I then only get a data frame with the top three ndvi rows of stations in 2000 and not the years after.

Does anyone know how I can fix this?

Thank you.

Upvotes: 0

Views: 123

Answers (2)

Rich Scriven
Rich Scriven

Reputation: 99331

I plugged in some arbitrary "2001" years to show the separation. I prefer to order the data by the column of interest first, then split that. You can use do.call(rbind, ...) on the result if you choose. The result is the top three "ndvi" by station by year.

> dat$year[c(8:12, 16:18)] <- 2001  ## add some 2001 years
> ord <- dat[order(-dat$ndvi), ]
> lapply(split(ord, list(ord$station, ord$year)), head, 3)
$a.2000
   t station year month   ndvi altitude precipitation
1  8       a 2000   aug 0.7793     2143         592.9
2  9       a 2000  sept 0.7524     2143         135.3
3 10       a 2000   oct 0.6597     2143          77.5

$b.2000
   t station year month   ndvi altitude precipitation
13 8       b 2000   aug 0.7523      122         832.3
14 9       b 2000  sept 0.7003      122         229.7
15 7       b 2000   jul 0.6670      122         662.0

$a.2001
   t station year month   ndvi altitude precipitation
8  6       a 2001   jun 0.4969     2143         505.9
9  5       a 2001   may 0.4756     2143         241.7
10 2       a 2001   feb 0.4396     2143           4.0

$b.2001
   t station year month   ndvi altitude precipitation
16 5       b 2001   may 0.6639      122         323.3
17 4       b 2001   apr 0.5930      122          88.6
18 6       b 2001   jun 0.5508      122         752.1

Upvotes: 0

BrodieG
BrodieG

Reputation: 52637

You need to split by the interaction of station AND year:

R <- R[order(R$ndvi, decreasing=T), ]
top3 <- split(R, interaction(R$station, R$year))   # <<<<<<<<<< this is the change
subsetted.data <- lapply(top3, FUN = function(x) head(x, 3))
subsetted.data
flatten.data <- do.call("rbind", subsetted.data)

This works (see my data at the end). That said, this type of thing is much easier handled with packages like data.table:

library(data.table)
data.table(R)[order(ndvi, decreasing=T), head(.SD, 3), by=list(station, year)]

Note you can order data.tables faster by using keys, but I'm omitting that for the sake of clarity here.


Data:

set.seed(1)
R <- expand.grid(year=2000:2010, station=letters[1:5], month=month.abb)
R$ndvi <- runif(nrow(R))

Upvotes: 1

Related Questions