SMG
SMG

Reputation: 11

Improve speed of r 'for loop'

I'm fairly new to r and would like some help in speeding up the following 'for loop' code.

The code aims to create a percentile rank for each row, based on the ROE column, by Sector and Month

I'm replicating an excel spreadsheet in r, the spreadsheet uses the PERCENTRANK.INC function and the r code needs to exactly replicate it. I've investigated options in r to match this function, a dplyr method etc. but none seem to exactly replicate the results except for the loop below.

The crux of the problem is that the loop is taking 30 minutes to create all the percentile ranks for the dataframe (total rows in the input dataframe is around 90,000). Has anyone got any tips to speed up the below loop? I've read a number of the similar questions/answers on this site and tried a number of things, such as making adjustments to the subset statements near the top of the loop, however have only succeeded in limited improvements.

Details on the input file 'ROE_Quintiles'

Thanks a lot for your help.

SMG

My r code is shown below:

# Create dataframe to append to at the end of each iteration
ROE_Quintiles3 <- data.frame("Merge_Var3" = c('Temp'), "ROE2_percrank" = c(0.5))
End <- nrow(ROE_Quintiles)
system.time({
  for(i in 1:End) {
    Row <- ROE_Quintiles[i,]
    Row_Value <- subset(Row, select=c(ROE2))
    Row_Value2 <- mean(Row_Value$ROE2) # PercentRankArgument Value

    Row_Sector_Month <- subset(Row, select=c(Merge_Var4))
    Row_Sector_Month_Values <- subset(ROE_Quintiles, Merge_Var4==Row_Sector_Month$Merge_Var4, select=c(ROE2))

    # Filter Number to values less than the row value
    NumberLessThanArgument = subset(Row_Sector_Month_Values, ROE2 < Row_Value2)

    # Filter Number to values greater than or equal to the row value
    NumberGreaterThanOrEqualArgument = subset(Row_Sector_Month_Values, ROE2 >= Row_Value2)

    # RankLower = the count of Numbers less than row value, and is used later for 
    # interpolation of ranks
    RankLower <- nrow(NumberLessThanArgument)

    # NumberLower = the largest Number < row value, used for interpolation
    NumberLower <- ifelse(RankLower==0, Row_Value2, max(NumberLessThanArgument))

    # NumberUpper = the smallest Number >= row value, used for interpolation
    NumberUpper = min(NumberGreaterThanOrEqualArgument) 

    # PercentRankArgumentRank =  the rank of row value over the Number table, which is 
    # just RankLower + 1. This is the same rank as NumberUpper in the Number table itself.
    PercentRankArgumentRank = RankLower + 1

    # InterpolationFraction = fraction that row value is from NumberLower to NumberUpper  
    InterpolationFraction <- ifelse(RankLower==0, 0, (Row_Value2 - NumberLower)/(NumberUpper - NumberLower))

    # Calculate the interpolated rank
    RankInterpolated = max(1, RankLower + InterpolationFraction * (PercentRankArgumentRank - RankLower))

    # Get the count of Numbers
    NumberCount = nrow(Row_Sector_Month_Values)

    # Final PercentRank is (RankInterpolated - 1)/(NumberCount - 1)
    PercentRankOutput = (RankInterpolated - 1)/(NumberCount - 1)

    # Append to create main dataframe
    Row_Output <- subset(Row, select=c(Merge_Var3))
    Row_Output$ROE2_percrank <- PercentRankOutput
    ROE_Quintiles3 <- rbind(ROE_Quintiles3, Row_Output)
  }
})
ROE_Quintiles3 <- subset(ROE_Quintiles3, Merge_Var3 != 'Temp')

Upvotes: 0

Views: 157

Answers (1)

F. Priv&#233;
F. Priv&#233;

Reputation: 11738

As most of the questions "my R loop is slow", the problem is often about growing objects inside the loop. When I see ROE_Quintiles3 <- rbind(ROE_Quintiles3, Row_Output) in your loop, I guess this is the problem here.

See https://privefl.github.io/blog/why-loops-are-slow-in-r/ to understand the problem I'm pointing you to and several solutions (tip: I would go for something similar to gen_list()).

Upvotes: 2

Related Questions