alex
alex

Reputation: 1145

R: why, how to avoid: read.table turns character (strings) to numeric by removing last character (colon)

Have a dataframe which I want to export to CSV and re-import to dataframe. When importing one column is corrupted -- by removing the colon from the end of the strings, and interpreting them as numeric.

Here a minimal example:

df <- data.frame(integers = c(1:8, NA, 10L),
                 doubles  = as.numeric(paste0(c(1:7, NA, 9, 10), ".1")),
                 strings = paste0(c(1:10),".")
                 )
df
str(df) # here the last column is "chr"

write.table(df,
            file = "df.csv",
            sep = "\t",
            na = "NA",
            row.names = FALSE,
            col.names = TRUE,
            fileEncoding = "UTF-8",
)

df <- read.table(file = "df.csv",
                 header = TRUE,
                 sep = "\t",
                 na.strings = "NA",
                 quote="\"",
                 fileEncoding = "UTF-8"
                 )
df
str(df)  # here the last column is "num"

Upvotes: 3

Views: 1934

Answers (1)

akrun
akrun

Reputation: 887691

With read.table, we can specify the colClasses specified in ?vector

The atomic modes are "logical", "integer", "numeric" (synonym "double"), "complex", "character" and "raw".

The issues is that ?read.table colClasses uses type.convert if not specified to automatically judge the type of the column

Unless colClasses is specified, all columns are read as character columns and then converted using type.convert to logical, integer, numeric, complex or (depending on as.is) factor as appropriate.

The relevant code in read.table would be

...
     do[1L] <- FALSE
    for (i in (1L:cols)[do]) {
        data[[i]] <- if (is.na(colClasses[i])) 
            type.convert(data[[i]], as.is = as.is[i], dec = dec, 
                numerals = numerals, na.strings = character(0L))
        else if (colClasses[i] == "factor") 
            as.factor(data[[i]])
        else if (colClasses[i] == "Date") 
            as.Date(data[[i]])
        else if (colClasses[i] == "POSIXct") 
            as.POSIXct(data[[i]])
        else methods::as(data[[i]], colClasses[i])
    }
...
df <- read.table(file = "df.csv",
                 header = TRUE,
                 sep = "\t",
                 na.strings = "NA",
                 quote="\"",
                 fileEncoding = "UTF-8", 
           colClasses = c("integer", "numeric", "character")
                 )

-checking the struture

str(df)
'data.frame':   10 obs. of  3 variables:
 $ integers: int  1 2 3 4 5 6 7 8 NA 10
 $ doubles : num  1.1 2.1 3.1 4.1 5.1 6.1 7.1 NA 9.1 10.1
 $ strings : chr  "1." "2." "3." "4." ...

Upvotes: 3

Related Questions