Ido Tamir
Ido Tamir

Reputation: 3137

Automatic casting of data.frame columns

I am interested in a method that automatically converts a data frame consisting of factor columns (like df) to the best possible type, similar to what read.table creates (like df2). One possibility could be to write the data frame into a string and read it back in with read.table. Are there other ones?

> df <- data.frame(a=c(" 1"," 2", " 3"),b=c("a","b","c"),c=c(" 1.0", "NA", " 2.0"),d=c(" 1", "B", "2"))
> str(df)
'data.frame':   3 obs. of  4 variables:
 $ a: Factor w/ 3 levels " 1"," 2"," 3": 1 2 3
 $ b: Factor w/ 3 levels "a","b","c": 1 2 3
 $ c: Factor w/ 3 levels " 1.0"," 2.0",..: 1 3 2
 $ d: Factor w/ 3 levels " 1","2","B": 1 3 2
> df2 <- with(df, data.frame(a=as.integer(a),b=b,c=as.numeric(c),d=as.character(d), stringsAsFactors=FALSE))
> str(df2)
'data.frame':   3 obs. of  4 variables:
 $ a: int  1 2 3
 $ b: Factor w/ 3 levels "a","b","c": 1 2 3
 $ c: num  1 3 2
 $ d: chr  " 1" "B" "2"

Upvotes: 2

Views: 425

Answers (1)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193537

Use the function that read.table makes use of: type.convert.

Example:

df <- data.frame(a=c(" 1"," 2", " 3"), b=c("a","b","c"), 
                 c=c(" 1.0", "NA", " 2.0"), d=c(" 1", "B", "2"))
str(df)
# 'data.frame':  3 obs. of  4 variables:
#  $ a: Factor w/ 3 levels " 1"," 2"," 3": 1 2 3
#  $ b: Factor w/ 3 levels "a","b","c": 1 2 3
#  $ c: Factor w/ 3 levels " 1.0"," 2.0",..: 1 3 2
#  $ d: Factor w/ 3 levels " 1","2","B": 1 3 2
df[] <- lapply(df, function(y) type.convert(as.character(y)))
df
#   a b  c  d
# 1 1 a  1  1
# 2 2 b NA  B
# 3 3 c  2  2
str(df)
# 'data.frame':  3 obs. of  4 variables:
#  $ a: int  1 2 3
#  $ b: Factor w/ 3 levels "a","b","c": 1 2 3
#  $ c: num  1 NA 2
#  $ d: Factor w/ 3 levels " 1","2","B": 1 3 2

(But I'm not sure if this is what you're looking for...)


Update: If you wanted to create a colClasses type function, perhaps you can try a function like this. Unlike your question title, this is not "automatic", but it does allow you to specify the column class for each column instead of leaving it to type.convert to decide.

toColClasses <- function(inDF, colClasses) {
  if (length(colClasses) != length(inDF)) stop("Please specify colClasses for each column")
  inDF[] <- lapply(seq_along(colClasses), function(y) {
    if (colClasses[y] == "") inDF[y] <- inDF[[y]]
    else {
      FUN <- match.fun(colClasses[y])
      inDF[y] <- suppressWarnings(FUN(as.character(inDF[[y]])))
    }
  })
  inDF
}

You would use it as follows:

df <- data.frame(a = c(" 1"," 2", " 3"), b = c("a","b","c"), 
                 c = c(" 1.0", "NA", " 2.0"), d = c(" 1", "B", "2"))

df2 <- toColClasses(df, c("as.integer", "", "as.numeric", "as.character"))
df2
#   a b  c  d
# 1 1 a  1  1
# 2 2 b NA  B
# 3 3 c  2  2
str(df2)
# 'data.frame':  3 obs. of  4 variables:
#  $ a: int  1 2 3
#  $ b: Factor w/ 3 levels "a","b","c": 1 2 3
#  $ c: num  1 NA 2
#  $ d: chr  " 1" "B" "2"

You would have to do some more work on the function to get it to accept a wider range of as... functions though.

Upvotes: 3

Related Questions