Reputation: 3137
I am interested in a method that automatically converts a data frame consisting of factor columns (like df) to the best possible type, similar to what read.table creates (like df2). One possibility could be to write the data frame into a string and read it back in with read.table. Are there other ones?
> df <- data.frame(a=c(" 1"," 2", " 3"),b=c("a","b","c"),c=c(" 1.0", "NA", " 2.0"),d=c(" 1", "B", "2"))
> str(df)
'data.frame': 3 obs. of 4 variables:
$ a: Factor w/ 3 levels " 1"," 2"," 3": 1 2 3
$ b: Factor w/ 3 levels "a","b","c": 1 2 3
$ c: Factor w/ 3 levels " 1.0"," 2.0",..: 1 3 2
$ d: Factor w/ 3 levels " 1","2","B": 1 3 2
> df2 <- with(df, data.frame(a=as.integer(a),b=b,c=as.numeric(c),d=as.character(d), stringsAsFactors=FALSE))
> str(df2)
'data.frame': 3 obs. of 4 variables:
$ a: int 1 2 3
$ b: Factor w/ 3 levels "a","b","c": 1 2 3
$ c: num 1 3 2
$ d: chr " 1" "B" "2"
Upvotes: 2
Views: 425
Reputation: 193537
Use the function that read.table
makes use of: type.convert
.
Example:
df <- data.frame(a=c(" 1"," 2", " 3"), b=c("a","b","c"),
c=c(" 1.0", "NA", " 2.0"), d=c(" 1", "B", "2"))
str(df)
# 'data.frame': 3 obs. of 4 variables:
# $ a: Factor w/ 3 levels " 1"," 2"," 3": 1 2 3
# $ b: Factor w/ 3 levels "a","b","c": 1 2 3
# $ c: Factor w/ 3 levels " 1.0"," 2.0",..: 1 3 2
# $ d: Factor w/ 3 levels " 1","2","B": 1 3 2
df[] <- lapply(df, function(y) type.convert(as.character(y)))
df
# a b c d
# 1 1 a 1 1
# 2 2 b NA B
# 3 3 c 2 2
str(df)
# 'data.frame': 3 obs. of 4 variables:
# $ a: int 1 2 3
# $ b: Factor w/ 3 levels "a","b","c": 1 2 3
# $ c: num 1 NA 2
# $ d: Factor w/ 3 levels " 1","2","B": 1 3 2
(But I'm not sure if this is what you're looking for...)
Update: If you wanted to create a colClasses
type function, perhaps you can try a function like this. Unlike your question title, this is not "automatic", but it does allow you to specify the column class for each column instead of leaving it to type.convert
to decide.
toColClasses <- function(inDF, colClasses) {
if (length(colClasses) != length(inDF)) stop("Please specify colClasses for each column")
inDF[] <- lapply(seq_along(colClasses), function(y) {
if (colClasses[y] == "") inDF[y] <- inDF[[y]]
else {
FUN <- match.fun(colClasses[y])
inDF[y] <- suppressWarnings(FUN(as.character(inDF[[y]])))
}
})
inDF
}
You would use it as follows:
df <- data.frame(a = c(" 1"," 2", " 3"), b = c("a","b","c"),
c = c(" 1.0", "NA", " 2.0"), d = c(" 1", "B", "2"))
df2 <- toColClasses(df, c("as.integer", "", "as.numeric", "as.character"))
df2
# a b c d
# 1 1 a 1 1
# 2 2 b NA B
# 3 3 c 2 2
str(df2)
# 'data.frame': 3 obs. of 4 variables:
# $ a: int 1 2 3
# $ b: Factor w/ 3 levels "a","b","c": 1 2 3
# $ c: num 1 NA 2
# $ d: chr " 1" "B" "2"
You would have to do some more work on the function to get it to accept a wider range of as...
functions though.
Upvotes: 3