Reputation: 135
I have a data frame that contains missing data (NA). I want to find the columns that have only one non-missing value and then fill those column with that value. For instance, if my original data set is like this:
df = data.frame(A=c(1,2,NA,NA,1), B=c(NA,NA,3,3,3), C=c(2,5,6,3,6), D = c(NA,1, NA,1,1))
A B C D
1 1 NA 2 NA
2 2 NA 5 1
3 NA 3 6 NA
4 NA 3 3 1
5 1 3 6 1
I want to attain:
A B C D
1 1 3 2 1
2 2 3 5 1
3 NA 3 6 1
4 NA 3 3 1
5 1 3 6 1
I first created this function to do what I want: (please let me know if you have a better option to this)
fill_NAs <- function(x){
x %>% fill(., .direction = "up") %>%
fill(., .direction = "down")
}
And then I tried using mutate_at or mutate_if functions but I couldn't manage to make it work. Some of my attempts were the following:
1)
df= df %>% mutate_if ( ~ length(unique(na.omit(.)))==1, ~ fill_NAs(.))
I got this error:
Error in mutate_impl(.data, dots) : Evaluation error: no applicable method for 'fill_' applied to an object of class "c('double', 'numeric')"
.
2)
df_PMM_imp = df_PMM_imp %>% mutate_at(.,names(select_if(.,length(unique(na.omit(.)))==1)), ~ fill_NAs(.))
I got this error:
Error in tbl_if_vars(.tbl, .predicate, caller_env(), .include_group_vars = TRUE) : length(.p) == length(tibble_vars) is not TRUE
In general I can't seem to understand the logic of mutate_if or mutate_at, I always run into such problems. So my questions are:
1) What is the correct syntax which achieves what I want? 2) Why do I get these errors above, what am I doing wrong?
Many thanks in advance.
Upvotes: 2
Views: 585
Reputation: 14764
A dplyr
way:
library(dplyr)
df %>%
mutate_all(~ case_when(
n_distinct(.[na.omit(.)]) == 1 ~ first(na.omit(.)),
TRUE ~ .
)
)
Output:
A B C D
1 1 3 2 1
2 2 3 5 1
3 NA 3 6 1
4 NA 3 3 1
5 1 3 6 1
Upvotes: 0
Reputation: 26343
You could iterate over your columns, check the length of the unique elements that are not NA
and replace the NA
s in a column if it contains only a single unique element.
df[] <- lapply(df, function(x) {
y <- unique(na.omit(x))
if(length(y) == 1) {
x <- y
} else x
})
Result
df
A B C D
1 1 3 2 1
2 2 3 5 1
3 NA 3 6 1
4 NA 3 3 1
5 1 3 6 1
Upvotes: 1
Reputation: 135
Hi @Elif Cansu Akoğuz,
You can achieve your goal fairly easy by using a for()
-loop to iterate over your dataframe and than check per column if it meets your requirements:
for (i in 1:ncol(df)) {
if (any(is.na(unique(df[i]))) == TRUE & nrow(unique(df[i])) == 2) {
...
}
}
With the statement any(is.na(unique(df[i])))
you check if there is a NA
somewhere in the column, and with nrow(unique(df[i])) == 2
you check if the column only consists of two unique values. So if a column meets both statements, you know it consists of 2 values, of which one in NA
.
Next, you want to replace all values in the column by the value that is not NA
. To do this you can use na.omit()
to throw out the NA
value, and than repeat the remaining value for however long the dataframe is.
A problem is that this will return a list-format, so you will have to use the unlist()
-function to solve this.
The entire loop will end up looking like this:
for (i in 1:ncol(df)) {
if (any(is.na(unique(df[i]))) == TRUE & nrow(unique(df[i])) == 2) {
df[i] <- unlist(rep(na.omit(unique(df[i])), nrow(df)))
}
}
Good luck!
Upvotes: 0