YimYames
YimYames

Reputation: 99

Using for loop in R to remove outliers in various buckets

I have approximately 114,000 individual IDs in my dataset. Each individual ID has a combination of numbers from three different variables and each ID has a dollar value associated with it. There are approximately 4,000 different combinations of numbers from the three variables. In order words, the 140,000 IDs (and their dollars) are placed into one of the 4,000 buckets.

What I want to accomplish using the 'for loop' in R, is to remove the top 2% of IDs (based on the IDs associative dollar amount) from each of the 4,000 buckets. I do not want to remove the top 2% of dollars from each bucket.

The rows in my dataset are as follows: ID (distinct ID), Event_ID, AL_ID, ACR_ID, ID_Dollars **Event_ID, AL_ID, and ACR_ID are the 'three variables' I referenced above.

I don’t know how to reference a proper data set for this question or I would have included it in my post. Hopefully my explanation is clear enough to answer the question I posed, if not, I apologize.

Thanks in advance for your help, it’s much appreciated.

Updated 11/17/2014:

I now have the following script:

# create random data similar to your requirements
ID <- seq(1,114000)
Event_ID <- sample(1:40, 114000,replace=TRUE)
AL_ID <- sample(1:10, 114000,replace=TRUE)
ACR_ID <- sample(1:10, 114000,replace=TRUE)
ID_Dollars <- sample(1:200000,114000,replace=TRUE)
df <- data.frame(ID,Event_ID,AL_ID,ACR_ID,ID_Dollars)

# create buckets
groupDf <- group_by(df,Event_ID,AL_ID,ACR_ID)
groupDfSum <- summarize(groupDf, ID_Dollars=sum(as.numeric(ID_Dollars))
groupDfSumHowManyIDShouldBeCut <- mutate(groupDfSum,numberToCut = ceiling(ID_Dollars*0.02))

# Here I am using 1 but you should use maximum value of numberToCut. 
IDs_ToBeCut<-filter(groupDf, rank(ID_Dollars,   ties.method="first")==max(groupDfSumHowManyIDShouldBeCut$numberToCut))

While the script is a step in the right direction, it still doesn't accomplish what I ultimately want to do with the data set. My goal is to remove the top 2% of episode_ID's (based on dollars) from each of the 4,000 buckets. Let me know if I should provide any additional information. As always, thanks for your help.

Upvotes: 0

Views: 399

Answers (1)

Atilla Ozgur
Atilla Ozgur

Reputation: 14721

I wrote following solution using dplyr, as I understand your question. I may be wrong feel free to comment.

library(dplyr)

# create random data similar to your requirements
ID <- seq(1,114000)
Event_ID <- sample(1:40, 114000,replace=TRUE)
AL_ID <- sample(1:10, 114000,replace=TRUE)
ACR_ID <- sample(1:10, 114000,replace=TRUE)
ID_Dollars <- sample(1:200000,114000,replace=TRUE)
df <- data.frame(ID,Event_ID,AL_ID,ACR_ID,ID_Dollars)

# create buckets
groupDf <- group_by(df,Event_ID,AL_ID,ACR_ID)
groupDfSum <- summarize(groupDf,CountOfGroup=n())
groupDfSumHowManyIDShouldBeCut <- mutate(groupDfSum,numberToCut = ceiling(CountOfGroup*0.02))

# Here I am using 1 but you should use maximum value of numberToCut. 
IDs_ToBeCut<-filter(groupDf, rank(ID_Dollars, ties.method="first")==1)


# After finding suitable ID values. You should for loop IDs_ToBeCut, control if that ID is lesser than numberToCut, and remove that ID value from df.

Upvotes: 1

Related Questions