Issues when replacing outliers with mean in R

Question

I have an HR dataframe containing info related to employees in an organization, e.g., salary, department, ID, etc.

What I am trying to do is to replace the outliers (USD>200000) in the column "Salary_2018" for the "Sales" department with the mean of the column itself.

This is for a professional course I am following and I am given both the dataframe AND the code, which is:

library(readxl)
df<-read_excel("C:\Media Mean Mode.xlsx")
df1<-df[df$department=="Sales",]
df2 = df1
df2[df2$salary_2018<200000,]<-mean(df2$salary_2018)

In the video I am studying on, the instructor uses the very same dataframe with the very same code, and it works. However, when I try the same exact thing, I receive the following error as a result:

Errore: Assigned data `mean(df2$salary_2018)` must be compatible with existing data.
i Error occurred for column `department`.
x Can't convert  to .

I would understand the error if I were trying to replace the information in the "department" column, as the data type is "character".

But considering that I am working on "salary_2018", which is "double", why does the error refer to "department"?

Do you have any idea why this is happening?

Thanks!

EDIT: As suggested by Peter, I added the structure of the dataframe here below.

> dput(head(df, 5))

structure(list(age = c(41, 49, 37, 33, 27), department = c("Sales", 
"Research & Development", "Research & Development", "Research & Development", 
"Research & Development"), employee_number = c(1, 2, 4, 5, 7), 
    gender = c("Female", "Male", "Male", "Female", "Male"), job_level = c(2, 
    2, 1, 1, 1), marital_status = c("Single", "Married", "Single", 
    "Married", "Married"), over_time = c("Yes", "No", "Yes", 
    "Yes", "No"), performance_rating = c(3, 4, 3, 3, 3), totalW_working_years = c(8, 
    10, 7, 8, 6), training_times_last_year = c(0, 3, 3, 3, 3), 
    years_since_last_promotion = c(0, 1, 0, 3, 2), years_with_curr_manager = c(5, 
    7, 0, 0, 2), monthly_income = c(5993, 5130, 2090, 2909, 3468
    ), salary_2017 = c(71916, 61560, 25080, 34908, 41616), salary_2018 = c(79826.76, 
    75718.8, 28842, 38747.88, 46609.92), year_of_joining = c(2012, 
    2008, 2018, 2010, 2016), last_role_change = c(2014, 2011, 
    2018, 2011, 2016), percent_hike = c(11, 23, 15, 11, 12)), row.names = c(NA, 
-5L), class = c("tbl_df", "tbl", "data.frame"))

mnist · Accepted Answer

Even without the actual data, your code tries to replace ALL columns with the mean of salary where salary is below 200k (shouldn't it be above?). This is because you did not specify a column after the comma and an empty space means all columns. Note the difference in this code:

# all columns
mtcars[1:4, ]
#>                 mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710     22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

# columns one, two and three
mtcars[1:4, 1:3]
#>                 mpg cyl disp
#> Mazda RX4      21.0   6  160
#> Mazda RX4 Wag  21.0   6  160
#> Datsun 710     22.8   4  108
#> Hornet 4 Drive 21.4   6  258

In your case, try:

df2[df2$salary_2018 > 200000, "salary_2018"] <- mean(df2$salary_2018, na.rm = TRUE)

Issues when replacing outliers with mean in R

Answers (1)

Related Questions