Reputation: 25
I have an HR dataframe containing info related to employees in an organization, e.g., salary, department, ID, etc.
What I am trying to do is to replace the outliers (USD>200000) in the column "Salary_2018" for the "Sales" department with the mean of the column itself.
This is for a professional course I am following and I am given both the dataframe AND the code, which is:
library(readxl)
df<-read_excel("C:\\Media Mean Mode.xlsx")
df1<-df[df$department=="Sales",]
df2 = df1
df2[df2$salary_2018<200000,]<-mean(df2$salary_2018)
In the video I am studying on, the instructor uses the very same dataframe with the very same code, and it works. However, when I try the same exact thing, I receive the following error as a result:
Errore: Assigned data `mean(df2$salary_2018)` must be compatible with existing data.
i Error occurred for column `department`.
x Can't convert <double> to <character>.
I would understand the error if I were trying to replace the information in the "department" column, as the data type is "character".
But considering that I am working on "salary_2018", which is "double", why does the error refer to "department"?
Do you have any idea why this is happening?
Thanks!
EDIT: As suggested by Peter, I added the structure of the dataframe here below.
> dput(head(df, 5))
structure(list(age = c(41, 49, 37, 33, 27), department = c("Sales",
"Research & Development", "Research & Development", "Research & Development",
"Research & Development"), employee_number = c(1, 2, 4, 5, 7),
gender = c("Female", "Male", "Male", "Female", "Male"), job_level = c(2,
2, 1, 1, 1), marital_status = c("Single", "Married", "Single",
"Married", "Married"), over_time = c("Yes", "No", "Yes",
"Yes", "No"), performance_rating = c(3, 4, 3, 3, 3), totalW_working_years = c(8,
10, 7, 8, 6), training_times_last_year = c(0, 3, 3, 3, 3),
years_since_last_promotion = c(0, 1, 0, 3, 2), years_with_curr_manager = c(5,
7, 0, 0, 2), monthly_income = c(5993, 5130, 2090, 2909, 3468
), salary_2017 = c(71916, 61560, 25080, 34908, 41616), salary_2018 = c(79826.76,
75718.8, 28842, 38747.88, 46609.92), year_of_joining = c(2012,
2008, 2018, 2010, 2016), last_role_change = c(2014, 2011,
2018, 2011, 2016), percent_hike = c(11, 23, 15, 11, 12)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
Upvotes: 0
Views: 158
Reputation: 6954
Even without the actual data, your code tries to replace ALL columns with the mean of salary where salary is below 200k (shouldn't it be above?). This is because you did not specify a column after the comma and an empty space means all columns. Note the difference in this code:
# all columns
mtcars[1:4, ]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
#> Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
# columns one, two and three
mtcars[1:4, 1:3]
#> mpg cyl disp
#> Mazda RX4 21.0 6 160
#> Mazda RX4 Wag 21.0 6 160
#> Datsun 710 22.8 4 108
#> Hornet 4 Drive 21.4 6 258
In your case, try:
df2[df2$salary_2018 > 200000, "salary_2018"] <- mean(df2$salary_2018, na.rm = TRUE)
Upvotes: 1