Using for loop in R to remove outliers in various buckets

Question

I have approximately 114,000 individual IDs in my dataset. Each individual ID has a combination of numbers from three different variables and each ID has a dollar value associated with it. There are approximately 4,000 different combinations of numbers from the three variables. In order words, the 140,000 IDs (and their dollars) are placed into one of the 4,000 buckets.

What I want to accomplish using the 'for loop' in R, is to remove the top 2% of IDs (based on the IDs associative dollar amount) from each of the 4,000 buckets. I do not want to remove the top 2% of dollars from each bucket.

The rows in my dataset are as follows: ID (distinct ID), Event_ID, AL_ID, ACR_ID, ID_Dollars **Event_ID, AL_ID, and ACR_ID are the 'three variables' I referenced above.

I don’t know how to reference a proper data set for this question or I would have included it in my post. Hopefully my explanation is clear enough to answer the question I posed, if not, I apologize.

Thanks in advance for your help, it’s much appreciated.

Updated 11/17/2014:

I now have the following script:

# create random data similar to your requirements
ID <- seq(1,114000)
Event_ID <- sample(1:40, 114000,replace=TRUE)
AL_ID <- sample(1:10, 114000,replace=TRUE)
ACR_ID <- sample(1:10, 114000,replace=TRUE)
ID_Dollars <- sample(1:200000,114000,replace=TRUE)
df <- data.frame(ID,Event_ID,AL_ID,ACR_ID,ID_Dollars)

# create buckets
groupDf <- group_by(df,Event_ID,AL_ID,ACR_ID)
groupDfSum <- summarize(groupDf, ID_Dollars=sum(as.numeric(ID_Dollars))
groupDfSumHowManyIDShouldBeCut <- mutate(groupDfSum,numberToCut = ceiling(ID_Dollars*0.02))

# Here I am using 1 but you should use maximum value of numberToCut. 
IDs_ToBeCut<-filter(groupDf, rank(ID_Dollars,   ties.method="first")==max(groupDfSumHowManyIDShouldBeCut$numberToCut))

While the script is a step in the right direction, it still doesn't accomplish what I ultimately want to do with the data set. My goal is to remove the top 2% of episode_ID's (based on dollars) from each of the 4,000 buckets. Let me know if I should provide any additional information. As always, thanks for your help.

Using for loop in R to remove outliers in various buckets

Answers (1)

Related Questions