Ashok K Harnal
Ashok K Harnal

Reputation: 1221

Fill in NA values in tibble without converting it to data.frame

One normal way to fill in NA values in a data frame, loan, is as follows:

for (i in 1: ncol(loan))
  {
   if (is.character(loan[,i]))
    {
      loan[is.na(loan[ ,i]), i] <- "missing"
    }
  if (is.numeric(loan[,i]))
   {
     loan[is.na(loan[ ,i]), i] <- 9999
   }
}

But if the loan data-set is a tibble, the above method does not work as is.character(loan[,i]) is always FALSE and also is.numeric(loan[,i]) is also FALSE. Dataset loan's class is as below:

> class(loan)
[1] "tbl_df"     "tbl"        "data.frame"

To use the above for-loop for filing in missing values, I have to first convert 'loan' to a data frame with as.data.frame() and then use the for-loop.

Is it possible to directly manipulate a tibble without first converting it to a data.frame to fill in missing values?

Upvotes: 2

Views: 1496

Answers (1)

akrun
akrun

Reputation: 887118

We can use the tidyverse syntax to do this

library(tidyverse) 
loan %>% 
    mutate_if(is.character, funs(replace(., is.na(.), "missing"))) %>% 
    mutate_if(is.numeric, funs(replace(., is.na(.), 9999)))
# A tibble: 20 × 3
#      Col1  Col2    Col3
#     <chr> <dbl>   <chr>
#1        a  9999       A
#2        a     2       A
#3        d     3       A
#4        c  9999 missing
#5        c     1 missing
#6        e     3 missing
#7        a  9999       A
#8        d     2       A
#9        d     3       A
#10       a  9999       A
#11       c     1       A
#12       b     1       C
#13       d     1       A
#14       d  9999       B
#15       a     4       B
#16       e     1       C
#17       a     3       A
#18 missing     3       A
#19       c     3 missing
#20 missing     4 missing

As the dataset is a tibble, it will not get converted to vector by extracting with [, instead we need [[

for (i in 1: ncol(loan))  {
  if (is.character(loan[[i]])) {
  loan[is.na(loan[[i]]), i] <- "missing"
   }  if (is.numeric(loan[[i]]))     {
   loan[is.na(loan[[i]]), i] <- 9999
    }
  }

To understand the problem, we just need to look at the output of the extraction

head(is.na(loan[,1]))
#      Col1
#[1,] FALSE
#[2,] FALSE
#[3,] FALSE
#[4,] FALSE
#[5,] FALSE
#[6,] FALSE

head(is.na(loan[[1]]))
#[1] FALSE FALSE FALSE FALSE FALSE FALSE

In the for loop, we are using the rowindex as a logical matrix with 1 column in the first case, and the second case it is a vector which makes the difference

data

set.seed(24)
loan <- as_tibble(data.frame(Col1 = sample(c(NA, letters[1:5]), 20, 
   replace = TRUE), Col2 = sample(c(NA, 1:4), 20, replace = TRUE),
          Col3 = sample(c(NA, LETTERS[1:3]), 20, replace = TRUE), 
         stringsAsFactors=FALSE))

Upvotes: 2

Related Questions