Reputation: 11
I'm fairly new to r and would like some help in speeding up the following 'for loop' code.
The code aims to create a percentile rank for each row, based on the ROE column, by Sector and Month
I'm replicating an excel spreadsheet in r, the spreadsheet uses the PERCENTRANK.INC function and the r code needs to exactly replicate it. I've investigated options in r to match this function, a dplyr method etc. but none seem to exactly replicate the results except for the loop below.
The crux of the problem is that the loop is taking 30 minutes to create all the percentile ranks for the dataframe (total rows in the input dataframe is around 90,000). Has anyone got any tips to speed up the below loop? I've read a number of the similar questions/answers on this site and tried a number of things, such as making adjustments to the subset statements near the top of the loop, however have only succeeded in limited improvements.
Details on the input file 'ROE_Quintiles'
Thanks a lot for your help.
SMG
My r code is shown below:
# Create dataframe to append to at the end of each iteration
ROE_Quintiles3 <- data.frame("Merge_Var3" = c('Temp'), "ROE2_percrank" = c(0.5))
End <- nrow(ROE_Quintiles)
system.time({
for(i in 1:End) {
Row <- ROE_Quintiles[i,]
Row_Value <- subset(Row, select=c(ROE2))
Row_Value2 <- mean(Row_Value$ROE2) # PercentRankArgument Value
Row_Sector_Month <- subset(Row, select=c(Merge_Var4))
Row_Sector_Month_Values <- subset(ROE_Quintiles, Merge_Var4==Row_Sector_Month$Merge_Var4, select=c(ROE2))
# Filter Number to values less than the row value
NumberLessThanArgument = subset(Row_Sector_Month_Values, ROE2 < Row_Value2)
# Filter Number to values greater than or equal to the row value
NumberGreaterThanOrEqualArgument = subset(Row_Sector_Month_Values, ROE2 >= Row_Value2)
# RankLower = the count of Numbers less than row value, and is used later for
# interpolation of ranks
RankLower <- nrow(NumberLessThanArgument)
# NumberLower = the largest Number < row value, used for interpolation
NumberLower <- ifelse(RankLower==0, Row_Value2, max(NumberLessThanArgument))
# NumberUpper = the smallest Number >= row value, used for interpolation
NumberUpper = min(NumberGreaterThanOrEqualArgument)
# PercentRankArgumentRank = the rank of row value over the Number table, which is
# just RankLower + 1. This is the same rank as NumberUpper in the Number table itself.
PercentRankArgumentRank = RankLower + 1
# InterpolationFraction = fraction that row value is from NumberLower to NumberUpper
InterpolationFraction <- ifelse(RankLower==0, 0, (Row_Value2 - NumberLower)/(NumberUpper - NumberLower))
# Calculate the interpolated rank
RankInterpolated = max(1, RankLower + InterpolationFraction * (PercentRankArgumentRank - RankLower))
# Get the count of Numbers
NumberCount = nrow(Row_Sector_Month_Values)
# Final PercentRank is (RankInterpolated - 1)/(NumberCount - 1)
PercentRankOutput = (RankInterpolated - 1)/(NumberCount - 1)
# Append to create main dataframe
Row_Output <- subset(Row, select=c(Merge_Var3))
Row_Output$ROE2_percrank <- PercentRankOutput
ROE_Quintiles3 <- rbind(ROE_Quintiles3, Row_Output)
}
})
ROE_Quintiles3 <- subset(ROE_Quintiles3, Merge_Var3 != 'Temp')
Upvotes: 0
Views: 157
Reputation: 11738
As most of the questions "my R loop is slow", the problem is often about growing objects inside the loop. When I see ROE_Quintiles3 <- rbind(ROE_Quintiles3, Row_Output)
in your loop, I guess this is the problem here.
See https://privefl.github.io/blog/why-loops-are-slow-in-r/ to understand the problem I'm pointing you to and several solutions (tip: I would go for something similar to gen_list()
).
Upvotes: 2