Shaxi Liver
Shaxi Liver

Reputation: 1120

Replace outstanding values from the mean by NA

I would like to take a mean of each row from my data and find out how far from the mean is each value in the row. If the percentage is higher than 50 this value should be replaced with NA.

That's the data:

structure(list(Name = structure(c(18L, 19L, 5L, 13L, 14L, 31L
), .Label = c("AMC Javelin", "Cadillac Fleetwood", "Camaro Z28", 
"Chrysler Imperial", "Datsun 710", "Dodge Challenger", "Duster 360", 
"Ferrari Dino", "Fiat 128", "Fiat X1-9", "Ford Pantera L", "Honda Civic", 
"Hornet 4 Drive", "Hornet Sportabout", "Lincoln Continental", 
"Lotus Europa", "Maserati Bora", "Mazda RX4", "Mazda RX4 Wag", 
"Merc 230", "Merc 240D", "Merc 280", "Merc 280C", "Merc 450SE", 
"Merc 450SL", "Merc 450SLC", "Pontiac Firebird", "Porsche 914-2", 
"Toyota Corolla", "Toyota Corona", "Valiant", "Volvo 142E"), class = "factor"), 
    mpg_1 = c(125, 133, 143, 141, 134, 238), cyl_1 = c(114, 153, 
    112, 136, 128, 155), disp_1 = c(113, 143, 144, 131, 431, 
    331), hp_1 = c(332, 221, 113, 331, 134, 151)), .Names = c("Name", 
"mpg_1", "cyl_1", "disp_1", "hp_1"), row.names = c(NA, 6L), class = "data.frame")

and that's the desired output:

               Name mpg_1 cyl_1 disp_1 hp_1
1         Mazda RX4   125   114    113  NA
2     Mazda RX4 Wag   133   153    143  221
3        Datsun 710   143   112    144  113
4    Hornet 4 Drive   141   136    131  NA
5 Hornet Sportabout   134   128    NA   134
6           Valiant   238   155    331  151

There are two conditions as well.

  1. The only one outstanding value from the row can be replaced with NA. It's hard to believe that using 50% cutoff there will be two values because the mean would change completely but look at the second condition.
  2. Would be great if the cutoff percentage would be easy to modify. I make go lower than 50%.

Do you have any idea how to do it in efficient way ? Using a loop it looks doable but maybe there is more efficient way?

Upvotes: 0

Views: 56

Answers (1)

Sotos
Sotos

Reputation: 51592

From a statistical point view, as @Roland mentions in comments, this is not advised. But If you absolutely have to do it, then,

fun1 <- function(x, n){
  t <- which((x - mean(x))/mean(x) > n)[1]
  x[t] <- NA
  return(x)
}

df1[-1] <- t(apply(df1[-1], 1, fun1, 0.5))

df1
#               Name mpg_1 cyl_1 disp_1 hp_1
#1         Mazda RX4   125   114    113   NA
#2     Mazda RX4 Wag   133   153    143  221
#3        Datsun 710   143   112    144  113
#4    Hornet 4 Drive   141   136    131   NA
#5 Hornet Sportabout   134   128     NA  134
#6           Valiant   238   155     NA  151

Upvotes: 3

Related Questions