Ahmed Mostafa
Ahmed Mostafa

Reputation: 15

For loop converting NA in factor variables into "None"

I want to convert the NAs in my factor variables into a string "None" that will be a level in my data set.

i have tried

for ( col in 1:ncol(data)){
  class(data$col) == "factor"
  data$col = addNA(data$col)
  levels(data$col) <- c(levels(data$col), "None")
  print(summary(data))
}

And i got this error

Unknown or uninitialised column: `col`.Unknown or uninitialised column: `col`.Error: Assigned data `addNA(cdata$col)` must be compatible with existing data.
x Existing data has 1000 rows.
x Assigned data has 0 rows.
i Only vectors of size 1 are recycled.

What is the problem in this way? What is the better way to do this for all factor columns at once rather that doing each column alone.

Upvotes: 1

Views: 74

Answers (2)

TarJae
TarJae

Reputation: 79184

Here is an alternative way:

  1. identify which columns are factor
  2. Add "None" to the levels of each factor
  3. Replace NA's by "None":

Here is an example with a mock dataset:

# identify which is factor column
x <-  sapply(df, is.factor) 

df[, x] <- lapply(df[, x], function(.){
    levels(.) <- c(levels(.), "None")
    replace(., is.na(.), "None")
})

output:

  a     b         c
  <fct> <fct> <dbl>
1 1     None      2
2 None  3        NA
3 4     None     NA

data:

df <- structure(list(a = structure(c(1L, NA, 2L), .Label = c("1", "4"
), class = "factor"), b = structure(c(NA, 1L, NA), .Label = "3", class = "factor"), 
c = c(2, NA, NA)), row.names = c(NA, -3L), class = c("tbl_df", 
"tbl", "data.frame"))

Upvotes: 1

akrun
akrun

Reputation: 887651

We can loop across the columns that are factor, convert the NA to "None" using fct_explicit_na from forcats

library(dplyr)
library(forcats)
data <- data %>%
     mutate(across(where(is.factor), ~ fct_explicit_na(., na_level = "None")))

In the for loop, there are multiple issues

  1. class(data$col) == "factor" is checked, but it should be inside an if(...) expression
  2. data$col - is wrong as there are no column names with col as name, instead it would be data[[col]]
  3. summary(data) can be checked outside the for loop
for (col in seq_along(data)){
  if(class(data[[col]]) == "factor") {
     data[[col]] = addNA(data[[col]])
     levels(data[[col]]) <- c(levels(data[[col]]), "None")    
   }
}

print(summary(data))

Upvotes: 1

Related Questions