user2762934
user2762934

Reputation: 2452

Replace value in a column based on a Frequency Count using R

I have a dataset with multiple columns. Many of these columns contain over 32 factors, so to run a Random Forest (for example), I want to replace values in the column based on their Frequency Count.

One of the column reads like this:

$ country                                    
: Factor w/ 92 levels "China","India","USA",..: 30 39 39 20 89 30 16 21 30 30 ...

What I would like to do is only retain the top N (where N is a value between 5 and 20) countries, and replace the remaining values with "Other". I know how to calculate the frequency of the values using the table function, but I can't seem to find a solution for replacing values on the basis of such a rule. How can this be done?

Upvotes: 1

Views: 1254

Answers (1)

thelatemail
thelatemail

Reputation: 93938

Some example data:

set.seed(1)
x <- factor(sample(1:5,100,prob=c(1,3,4,2,5),replace=TRUE))
table(x)
# 1  2  3  4  5 
# 4 26 30 13 27 

Replace all the levels other than the top 3 (Levels 2/3/5) with "Other":

levels(x)[rank(table(x)) < 3] <- "Other"

table(x)
#Other     2     3     5 
#   17    26    30    27

Upvotes: 3

Related Questions