Elif Cansu Akoğuz
Elif Cansu Akoğuz

Reputation: 135

R -Finding the columns with only one non-missing value and filling its missing values with that unique non-missing value

I have a data frame that contains missing data (NA). I want to find the columns that have only one non-missing value and then fill those column with that value. For instance, if my original data set is like this:

  df = data.frame(A=c(1,2,NA,NA,1), B=c(NA,NA,3,3,3), C=c(2,5,6,3,6), D = c(NA,1, NA,1,1))

   A  B C  D
1  1 NA 2 NA
2  2 NA 5  1
3 NA  3 6 NA
4 NA  3 3  1
5  1  3 6  1 

I want to attain:

   A  B C  D
1  1  3 2  1
2  2  3 5  1
3 NA  3 6  1
4 NA  3 3  1
5  1  3 6  1 

I first created this function to do what I want: (please let me know if you have a better option to this)

fill_NAs <- function(x){
  x %>% fill(., .direction = "up") %>%
        fill(., .direction = "down")
}

And then I tried using mutate_at or mutate_if functions but I couldn't manage to make it work. Some of my attempts were the following:

1)

df= df %>% mutate_if ( ~ length(unique(na.omit(.)))==1, ~ fill_NAs(.))

I got this error:

Error in mutate_impl(.data, dots) : Evaluation error: no applicable method for 'fill_' applied to an object of class "c('double', 'numeric')"

.

2)

df_PMM_imp = df_PMM_imp %>% mutate_at(.,names(select_if(.,length(unique(na.omit(.)))==1)), ~ fill_NAs(.))

I got this error:

Error in tbl_if_vars(.tbl, .predicate, caller_env(), .include_group_vars = TRUE) : length(.p) == length(tibble_vars) is not TRUE

In general I can't seem to understand the logic of mutate_if or mutate_at, I always run into such problems. So my questions are:

1) What is the correct syntax which achieves what I want? 2) Why do I get these errors above, what am I doing wrong?

Many thanks in advance.

Upvotes: 2

Views: 585

Answers (3)

arg0naut91
arg0naut91

Reputation: 14764

A dplyr way:

library(dplyr)

df %>% 
  mutate_all(~ case_when(
    n_distinct(.[na.omit(.)]) == 1 ~ first(na.omit(.)),
    TRUE ~ .
    )
  )

Output:

   A B C D
1  1 3 2 1
2  2 3 5 1
3 NA 3 6 1
4 NA 3 3 1
5  1 3 6 1

Upvotes: 0

markus
markus

Reputation: 26343

You could iterate over your columns, check the length of the unique elements that are not NA and replace the NAs in a column if it contains only a single unique element.

df[] <- lapply(df, function(x) {
  y <- unique(na.omit(x))
  if(length(y) == 1) {
    x <- y
  } else x
})

Result

df
   A B C D
1  1 3 2 1
2  2 3 5 1
3 NA 3 6 1
4 NA 3 3 1
5  1 3 6 1

Upvotes: 1

JNab
JNab

Reputation: 135

Hi @Elif Cansu Akoğuz,

You can achieve your goal fairly easy by using a for()-loop to iterate over your dataframe and than check per column if it meets your requirements:

for (i in 1:ncol(df)) {
    if (any(is.na(unique(df[i]))) == TRUE & nrow(unique(df[i])) == 2) {
        ...
    }
}

With the statement any(is.na(unique(df[i]))) you check if there is a NA somewhere in the column, and with nrow(unique(df[i])) == 2 you check if the column only consists of two unique values. So if a column meets both statements, you know it consists of 2 values, of which one in NA.

Next, you want to replace all values in the column by the value that is not NA. To do this you can use na.omit() to throw out the NA value, and than repeat the remaining value for however long the dataframe is. A problem is that this will return a list-format, so you will have to use the unlist()-function to solve this.

The entire loop will end up looking like this:

for (i in 1:ncol(df)) {
    if (any(is.na(unique(df[i]))) == TRUE & nrow(unique(df[i])) == 2) {
        df[i] <- unlist(rep(na.omit(unique(df[i])), nrow(df)))
    }
}

Good luck!

Upvotes: 0

Related Questions