Reputation: 11
the data frame I am working on contains many factors. Take the categorical variables from mtcars (cyl, vs, am, gear, carb)
.
head(mtcars[c("cyl","vs","am","gear","carb")])
cyl vs am gear carb
Mazda RX4 6 0 1 4 4
Mazda RX4 Wag 6 0 1 4 4
Datsun 710 4 1 1 4 1
Hornet 4 Drive 6 1 0 3 1
Hornet Sportabout 8 0 0 3 2
Valiant 6 1 0 3 1
Currently I have two nested for loops to extract those levels which occur less than in 10% of the time in the specific factor and assign it to a new level names. So I would like to assign those levsl in the factors to a new level named guz. Is there a elegant wqy to do that?
the output would be a data frame in which for everz factor (assume the cols above in the data set are factors) those rows which belong to a level that happens less than 10 of the time are ascribed to a new level guz. Take the level 2 in carb...it happens only once (okay more than 10 percent but just imagine it would be the case) then just class this level in this fdactor (and all other levels for which this is tru in the factor) into a new level names guz. The new carb colum would then be 4,4,1,1,guz,1.
the output for a 50% threshold would be
head(mtcars[c("cyl","vs","am","gear","carb")])
cyl vs am gear carb
Mazda RX4 6 0 1 4 guz
Mazda RX4 Wag 6 0 1 4 guz
Datsun 710 guz 1 1 4 1
Hornet 4 Drive 6 1 0 3 1
Hornet Sportabout guz 0 0 3 guz
Valiant 6 1 0 3 1
Upvotes: 1
Views: 219
Reputation: 94317
First lets make the columns in mtcars
into clear factors:
cols = c("vs","am","gear","cyl", "carb")
for(col in cols){mtcars[,col]=factor(paste0(col,mtcars[,col]))}
Now write a function that takes a factor and returns a factor with levels reclassified as you want. Make it flexible with the label and the threshold:
thresh_factor = function(F, thresh=0.1, label="guz"){
n=length(F)
t=table(F)
under=t<(n*thresh)
levels(F)[under]=label
F}
This can now be tested:
> thresh_factor(factor(1:20))
[1] guz guz guz guz guz guz guz guz guz guz guz guz guz guz guz guz guz guz guz
[20] guz
Levels: guz
they all become guz
because each of 1:20 is unique. More tests:
> thresh_factor(mtcars$carb)
[1] carb4 carb4 carb1 carb1 carb2 carb1 carb4 carb2 carb2 carb4 carb4 guz
[13] guz guz carb4 carb4 carb4 carb1 carb2 carb1 carb1 carb2 carb2 carb4
[25] carb2 carb1 carb2 carb2 carb4 guz guz carb2
Levels: carb1 carb2 guz carb4
Some of the levels there have been replaced. Another test:
> thresh_factor(mtcars$cyl)
[1] cyl6 cyl6 cyl4 cyl6 cyl8 cyl6 cyl8 cyl4 cyl4 cyl6 cyl6 cyl8 cyl8 cyl8 cyl8
[16] cyl8 cyl8 cyl4 cyl4 cyl4 cyl4 cyl8 cyl8 cyl8 cyl8 cyl4 cyl4 cyl4 cyl8 cyl6
[31] cyl8 cyl4
Levels: cyl4 cyl6 cyl8
And none of them there are replaced. Looks good. Now do over all the columns:
> for(col in cols){mtcars[,col]=thresh_factor(mtcars[,col])}
Just to test again using your sample output, with numeric factor levels, and 50% thresh:
> rm(mtcars) # start fresh
> mtcars=head(mtcars) # first 6 rows for test
> for(col in cols){mtcars[,col]=factor(mtcars[,col])} # convert columns to factors
now run my code:
> for(col in cols){mtcars[,col]=thresh_factor(mtcars[,col],thresh=0.5)}
> head(mtcars[c("cyl","vs","am","gear","carb")])
cyl vs am gear carb
Mazda RX4 6 0 1 4 guz
Mazda RX4 Wag 6 0 1 4 guz
Datsun 710 guz 1 1 4 1
Hornet 4 Drive 6 1 0 3 1
Hornet Sportabout guz 0 0 3 guz
Valiant 6 1 0 3 1
which looks like your expected output.
Upvotes: 2