Reputation: 414
Please see my code below:
# functions to get percentile threshold, and assign new values to outliers
get_low_perc <- function(var_name) {
return(quantile(var_name, c(0.01)))
}
get_hi_perc <- function(var_name) {
return(quantile(var_name, c(0.99)))
}
round_up <- function(target_var, flag_var, floor) {
target_var <- as.numeric(ifelse(flag_var == 1, floor, target_var))
return(as.integer(target_var))
}
round_down <- function(target_var, flag_var, ceiling) {
target_var <- as.numeric(ifelse(flag_var == 1, ceiling, target_var))
return(as.integer(target_var))
}
# try putting it all together
no_way <- function(df, df_col_name, df_col_flagH, df_col_flagL) {
lo_perc <- get_low_perc(df_col_name)
hi_perc <- get_hi_perc(df_col_name)
df$df_col_flagH <- as.factor(ifelse(df_col_name < lo_perc, 1, 0))
df$df_col_flagL <- as.factor(ifelse(df_col_name > hi_perc, 1, 0))
df_col_name <- round_up(df_col_name, df_col_flagL, lo_perc)
df_col_name <- round_down(df_col_name, df_col_flagH, hi_perc)
# names(df)[names(df)=='df_col_flagH'] <-
# boxplot(df_col_name)
return(df)
}
I have created 5 custom functions; the first two respectively get the 1th percentile and the 99th percentile of a given variable. The last two round the values in these variables up or down depending on how far away they are from the 1st percentile and the 99th percentile values. The last function is trying to put all these functions together to essentially output a new dataframe containing the same columns in the original df, the updated column, and two new columns indicating values that were flagged as below the 1st percentile and above the 99th percentile. I have produced a mock dataframe below, since I can't seem to pass some of my data here.
df2 = data.frame(col = c(1, 3, 4, 5, 8, 7, 67, 744, 876, 8, 8, 54, 9),
col1 = c(9, 6, 8, 3, 4, 5, 8, 7, 67, 744, 87, 33, 77),
col2 = c(8, 2, 8, 4, 87, 66, 54, 99, 77, 77, 88, 67, 102))
Ideally, after I call the function using the command "no_way(df2, df2$col1, df2$new_col1, df2$new_col2)", I want an output dataframe looking like:
df2 = data.frame(col = c(1, 3, 4, 5, 8, 7, 67, 744, 876, 8, 8, 54, 9),
col1 = c(9, 6, 8, 3, 4, 5, 8, 7, 67, 744, 87, 33, 77), # updated with appropriate values
col2 = c(8, 2, 8, 4, 87, 66, 54, 99, 77, 77, 88, 67, 102),
new_col1 = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0),
new_col2 = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0))
^ Where new_col1 and new_col2 are column names given by the user when calling the function. I am currently getting the dataframe as expected, but the new columns created have kept the function parameters' names, as in:
df2 = data.frame(col = c(1, 3, 4, 5, 8, 7, 67, 744, 876, 8, 8, 54, 9),
col1 = c(9, 6, 8, 3, 4, 5, 8, 7, 67, 744, 87, 33, 77), # updated with appropriate values
col2 = c(8, 2, 8, 4, 87, 66, 54, 99, 77, 77, 88, 67, 102),
df_col_flagH = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0),
df_col_flagL = c(0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0))
I would not mind changing the name of the columns afterwards, but I will be using this function of 17 columns therefore that wouldn't be optimal. Please help.
Upvotes: 0
Views: 292
Reputation: 389235
You should pass new column names as string.
Also ifelse(condition, 1, 0)
can be simplified to as.integer(condition)
.
no_way <- function(df, df_col_name, df_col_flagH, df_col_flagL) {
lo_perc <- get_low_perc(df[[df_col_name]])
hi_perc <- get_hi_perc(df[[df_col_name]])
df[[df_col_flagH]] <- as.factor(as.integer(df[[df_col_name]] < lo_perc))
df[[df_col_flagL]] <- as.factor(as.integer(df[[df_col_name]] > hi_perc))
df[[df_col_name]] <- round_up(df[[df_col_name]], df_col_flagL, lo_perc)
df[[df_col_name]] <- round_down(df[[df_col_name]], df_col_flagH, hi_perc)
return(df)
}
df2 <- no_way(df2, "col1", "new_col1", "new_col2")
df2
# col col1 col2 new_col1 new_col2
#1 1 9 8 0 0
#2 3 9 2 0 0
#3 4 9 8 0 0
#4 5 9 4 1 0
#5 8 9 87 0 0
#6 7 9 66 0 0
#7 67 9 54 0 0
#8 744 9 99 0 0
#9 876 9 77 0 0
#10 8 9 77 0 1
#11 8 9 88 0 0
#12 54 9 67 0 0
#13 9 9 102 0 0
Upvotes: 2