carlosmaria
carlosmaria

Reputation: 337

Calculation in columns using previous row value without loops

I have data in columns which I need to do calculations on. Is it possible to do this using previous row values without using a loop? E.g. if in the first column the value is 139, calculate the median of the last 5 values and the percent change of the value 5 rows above and the value in the current row?

ID  Data    PF
135 5       123
136 4       141
137 5       124
138 6       200
139 1       310
140 2       141
141 4       141

So here in this dataset you would do:

  1. Find 139 in ID column
  2. Return average of last 5 rows in Data (Gives 4.2)
  3. Return performance of values in PF 5 rows above to current value (Gives 152%)

If I would do a loop it looks like this:

for (i in 1:nrow(data)){
  if(data$ID == "139" & i>=3) 
    {data$New_column <- data[i,"PF"] / data[i-4,"PF"] - 1
}

The problem is that the loop takes too long due to to many data points. The ID 139 will appear several times in the dataset.

Many thanks. Carlos

Upvotes: 2

Views: 1427

Answers (3)

aichao
aichao

Reputation: 7435

As pointed out by Tutuchacn and Sotos, use the package zoo to get the mean of the Data in the last N rows (inclusive of the row) you are querying (assuming your data is in the data frame df):

library(zoo)

ind <- which(df$ID==139)                             ## this is the row you are querying
N <- 5                                               ## here, N is 5
res <- rollapply(df$Data, width=N, mean)[ind-(N-1)]
print(res)
## [1] 4.2

rollapply(..., mean) returns the rolling mean of the windowed data of width=N. Note that the index used to query the output from rollapply is lagged by N-1 because the rolling mean is applied forward in the series.

To get the percent performance from PF as you specified:

percent.performance <- function(x) {
  z <- zoo(x)                                        ## create a zoo series
  lz <- lag(z,4)                                     ## create the lag version
  return(z/lz - 1)
}
res <- as.numeric(percent.performance(df$PF)[ind])
print(res)
## [1] 1.520325

Here, we define a function percent.performance that returns what you want for all rows of df for which the computation makes sense. We then extract the row we want using ind and convert it to a number.

Hope this helps.

Upvotes: 2

Abdou
Abdou

Reputation: 13274

This could be a decent start:

mytext = "ID,Data,PF
135,5,123
136,4,141
137,5,124
138,6,200
139,1,310
140,2,141
141,4,141"

mydf <- read.table(text=mytext, header = T, sep = ",")

do.call(rbind,lapply(mydf$ID[which(mydf$ID==139):nrow(mydf)], function(x) {
    tempdf <- mydf[1:which(mydf$ID==x),]
    data.frame(ID=x,Data=mean(tempdf$Data),PF=100*(tempdf[nrow(tempdf),"PF"]-tempdf[(nrow(tempdf)-4),"PF"])/tempdf[(nrow(tempdf)-4),"PF"])
}))

ID     Data        PF
139 4.200000 152.03252
140 3.833333   0.00000
141 3.857143  13.70968

The idea here is: You take ID's starting from 139 to the end and use the lapply function on each of them by generating a temporary data.frame which includes all the rows above that particular ID (including the ID itself). Then you grab the mean of the Data column and the rate of change (i.e. what you call performance) of the PF column.

Upvotes: 0

Robert
Robert

Reputation: 5152

Is that what you want?

ntest=139
sol<-sapply(5:nrow(df),function(ii){#ii=6
  tdf<-df[(ii-4):ii,]
  if(tdf[5,1]==ntest)
    c(row=ii,aberage=mean(tdf[,"Data"]),performance=round(100*tdf[5,"PF"]/tdf[1,"PF"]-1,0))
})
sol<- sol[ ! sapply(sol, is.null) ] #remove NULLs
sol

[[1]]
        row     aberage performance 
        5.0         4.2       251.0 

Upvotes: 0

Related Questions