nsivakr
nsivakr

Reputation: 1595

Setting all NA to blank in r Dataframe

I've been successful most of the time but in one instance the following code, throws error.

Error: character string is not in a standard unambiguous format

current[is.na(current)] = ""

The following works. But how do I avoid writing 3 times?

isnaColumns <- sapply(current, is.character)
current[,isnaColumns] <- lapply(current[,isnaColumns], function(z) replace(z, is.na(z), ""))

isnaColumns <- sapply(current, is.numeric)
current[,isnaColumns] <- lapply(current[,isnaColumns], function(z) replace(z, is.na(z), "" ))

isnaColumns <- sapply(current, is.logical)
current[,isnaColumns] <- lapply(current[,isnaColumns], function(z) replace(z, is.na(z), "" ))

Upvotes: 0

Views: 156

Answers (1)

r2evans
r2evans

Reputation: 160817

I think an even better approach is to only update columns that make sense to update, as in character and possibly factor. The former is as simple as

ischr <- sapply(current, is.character)
current[,ischr] <- lapply(current[,ischr], function(z) replace(z, is.na(z), ""))

(Apologies for the previous code that was exploding incorrectly ...)

Testing with large-ish data:

n <- 1e7 # 10,000,000
set.seed(42) # R-4.0.2
current <- data.frame(
  int=sample(1000, size=n, replace=TRUE),
  chr1=sample(letters, size=n, replace=TRUE),
  chr2=sample(LETTERS, size=n, replace=TRUE),
  chr3=sample(letters, size=n, replace=TRUE),
  chr4=sample(LETTERS, size=n, replace=TRUE),
  chr5=sample(letters, size=n, replace=TRUE),
  chr6=sample(LETTERS, size=n, replace=TRUE)
)
ischr <- sapply(current, is.character)
ischr
#   int  chr1  chr2  chr3  chr4  chr5  chr6 
# FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE 
current[,ischr] <- lapply(current[,ischr], function(z) replace(z, sample(n, size=n/10), NA))
head(current)
#   int chr1 chr2 chr3 chr4 chr5 chr6
# 1 561    y    Z    m    D    y    P
# 2 997    l    D    q <NA>    c    C
# 3 321    z    Q    a    E <NA>    H
# 4 153    n    K <NA>    C    h    P
# 5  74 <NA>    I    t    S    y    N
# 6 228    e    C    s    Z    q    L
system.time({
  current[,ischr] <- lapply(current[,ischr], function(z) replace(z, is.na(z), ""))
})
#    user  system elapsed 
#    0.39    0.06    0.45 
head(current)
#   int chr1 chr2 chr3 chr4 chr5 chr6
# 1 561    y    Z    m    D    y    P
# 2 997    l    D    q         c    C
# 3 321    z    Q    a    E         H
# 4 153    n    K         C    h    P
# 5  74         I    t    S    y    N
# 6 228    e    C    s    Z    q    L

Upvotes: 2

Related Questions