robintw
robintw

Reputation: 28531

Stop data being read as factors by default with read.zoo

I am using the zoo package in R to analyse time series of data. I have the following data file:

Date(dd-mm-yy),Time(hh:mm:ss),Julian_Day,AOT_1640,AOT_1020,AOT_870,AOT_675,AOT_667,AOT_555,AOT_551,AOT_532,AOT_531,AOT_500,AOT_490,AOT_443,AOT_440,AOT_412,AOT_380,AOT_340,Water(cm),%TripletVar_1640,%TripletVar_1020,%TripletVar_870,%TripletVar_675,%TripletVar_667,%TripletVar_555,%TripletVar_551,%TripletVar_532,%TripletVar_531,%TripletVar_500,%TripletVar_490,%TripletVar_443,%TripletVar_440,%TripletVar_412,%TripletVar_380,%TripletVar_340,%WaterError,440-870Angstrom,380-500Angstrom,440-675Angstrom,500-870Angstrom,340-440Angstrom,440-675Angstrom(Polar),Last_Processing_Date(dd/mm/yyyy),Solar_Zenith_Angle
29:03:2011,09:26:28,88.393380,N/A,0.490230,0.553836,0.707512,N/A,N/A,N/A,N/A,N/A,0.911939,N/A,N/A,0.984430,N/A,1.046517,1.081283,1.632430,N/A,4.597345,4.551429,3.216097,N/A,N/A,N/A,N/A,N/A,2.587552,N/A,N/A,2.694179,N/A,2.085042,2.522511,2.309844,0.851964,0.497006,0.789257,0.898093,0.362423,N/A,13/04/2011,58.822462
29:03:2011,09:41:28,88.403796,N/A,0.440362,0.513093,0.676703,N/A,N/A,N/A,N/A,N/A,0.893867,N/A,N/A,0.965588,N/A,1.034943,1.079975,1.654521,N/A,12.867837,12.687550,11.037238,N/A,N/A,N/A,N/A,N/A,9.345739,N/A,N/A,8.423888,N/A,8.421787,9.334135,1.622026,0.937815,0.529939,0.852553,0.999260,0.431102,N/A,13/04/2011,57.070624
29:03:2011,10:11:29,88.424641,N/A,0.565148,0.654724,0.842142,N/A,N/A,N/A,N/A,N/A,1.070556,N/A,N/A,1.144966,N/A,1.208759,1.242663,1.666760,N/A,9.933505,9.499251,8.327355,N/A,N/A,N/A,N/A,N/A,6.781617,N/A,N/A,6.612952,N/A,5.600500,5.630695,1.302058,0.826713,0.438445,0.736362,0.884554,0.316539,N/A,13/04/2011,53.916620
29:03:2011,10:17:46,88.429005,N/A,0.593881,0.681572,0.866620,N/A,N/A,N/A,N/A,N/A,1.095508,N/A,N/A,1.168008,N/A,1.233022,1.268572,1.704882,N/A,4.072782,3.752197,3.210935,N/A,N/A,N/A,N/A,N/A,2.389567,N/A,N/A,2.385582,N/A,1.653326,1.015620,0.728711,0.798185,0.427272,0.716165,0.853963,0.319100,N/A,13/04/2011,53.323057
29:03:2011,10:26:27,88.435035,N/A,0.636627,0.714175,0.884887,N/A,N/A,N/A,N/A,N/A,1.092220,N/A,N/A,1.167024,N/A,1.224264,1.271774,1.626393,N/A,16.400200,10.585139,6.513873,N/A,N/A,N/A,N/A,N/A,3.169704,N/A,N/A,4.085949,N/A,3.963741,8.663229,10.035231,0.724581,0.411533,0.659996,0.764539,0.329073,N/A,13/04/2011,52.544475

I am trying to read it using the following code:

f <- function(d, t) as.chron(paste(as.Date(chron(d, format='d:m:y')), t))

z = read.zoo("110329_110329_Chilbolton.lev10", sep=',', header=T, index = 1:2, FUN=f, as.is=F, dec=".")

But all of the columns of the dataset are being read as factors - so, when I do summary(z) I get output like:

X.TripletVar_340    X.WaterError X440.870Angstrom X380.500Angstrom X440.675Angstrom X500.870Angstrom
 1.015620:1        0.728711:1     0.724581:1       0.411533:1       0.659996:1       0.764539:1      
 2.522511:1        1.302058:1     0.798185:1       0.427272:1       0.716165:1       0.853963:1      
 5.630695:1        1.622026:1     0.826713:1       0.438445:1       0.736362:1       0.884554:1      
 8.663229:1        2.309844:1     0.851964:1       0.497006:1       0.789257:1       0.898093:1      
 9.334135:1       10.035231:1     0.937815:1       0.529939:1       0.852553:1       0.999260:1      

How can I stop it reading the data as factors by default? The data is read fine by read.table without any extra parameters to tell it to make sure everything stays as numbers not factors - so why is read.zoo behaving differently?

I suppose I could use colClasses to specify the type of each column, but I'd rather not do this in case the order of the columns in the dataset is changed - getting it to convert to numbers by default, and then try factors if that doesn't work would be far better.

Any ideas?

Upvotes: 2

Views: 2597

Answers (4)

G. Grothendieck
G. Grothendieck

Reputation: 269664

This has been diagnosed already but let us add this so that we have an example of a read.zoo statement that could be used here.

There are two problems: (1) the NAs are represented as N/A rather than NA so we must tell it that. (2) the second last column is not numeric. zoo represents the data as a matrix so it must all be numeric (factor zoo objects are supported too but they can't be mixed).

Try this (where we have added a second data line to the example for good measure). Be sure to use the most recent version of zoo to run the example data since the text= argument (which specifies the text of the data itself rather than the filename) was only added recently. Also note that from within R ?read.zoo gives help and vignette("zoo-read") gives a document entirely devoted to read.zoo examples.

Lines <- "Date(dd-mm-yy),Time(hh:mm:ss),Julian_Day,AOT_1640,AOT_1020,AOT_870,AOT_675,AOT_667,AOT_555,AOT_551,AOT_532,AOT_531,AOT_500,AOT_490,AOT_443,AOT_440,AOT_412,AOT_380,AOT_340,Water(cm),%TripletVar_1640,%TripletVar_1020,%TripletVar_870,%TripletVar_675,%TripletVar_667,%TripletVar_555,%TripletVar_551,%TripletVar_532,%TripletVar_531,%TripletVar_500,%TripletVar_490,%TripletVar_443,%TripletVar_440,%TripletVar_412,%TripletVar_380,%TripletVar_340,%WaterError,440-870Angstrom,380-500Angstrom,440-675Angstrom,500-870Angstrom,340-440Angstrom,440-675Angstrom(Polar),Last_Processing_Date(dd/mm/yyyy),Solar_Zenith_Angle
29:03:2011,09:26:28,88.393380,N/A,0.490230,0.553836,0.707512,N/A,N/A,N/A,N/A,N/A,0.911939,N/A,N/A,0.984430,N/A,1.046517,1.081283,1.632430,N/A,4.597345,4.551429,3.216097,N/A,N/A,N/A,N/A,N/A,2.587552,N/A,N/A,2.694179,N/A,2.085042,2.522511,2.309844,0.851964,0.497006,0.789257,0.898093,0.362423,N/A,13/04/2011,58.822462
29:03:2012,09:26:28,88.393380,N/A,0.490230,0.553836,0.707512,N/A,N/A,N/A,N/A,N/A,0.911939,N/A,N/A,0.984430,N/A,1.046517,1.081283,1.632430,N/A,4.597345,4.551429,3.216097,N/A,N/A,N/A,N/A,N/A,2.587552,N/A,N/A,2.694179,N/A,2.085042,2.522511,2.309844,0.851964,0.497006,0.789257,0.898093,0.362423,N/A,13/04/2011,58.822462"

library(chron)
library(zoo)
colClasses <- c("character", "character", rep("numeric", 43))
colClasses[44] <- "NULL" # zap the non-numeric column
z <- read.zoo(text = Lines, header = TRUE, sep = ",", na.strings = "N/A",
    index = 1:2, colClasses = colClasses, FUN = function(d, t)
        as.chron(paste(d, t), "%d:%m:%Y %H:%M:%S"))

Upvotes: 2

IRTFM
IRTFM

Reputation: 263352

The problem appears to be that you are importing from an Excel file and not taking the time to make the "N/A" values into proper NA values. That results in the columns being considered non-numeric. The zoo package need the coredata to be a matrix and that severely constrains the option available for processing. Everything needs to be numeric. Even if you put in stringsAsFactors = FALSE you would still get character columns where you expected numeric.

If you read in with read.table and set as.is=TRUE, you can overcome the factor problem. You then need to coerce the columns that you want to be numeric and drop the trailing date columns that will come in with a name of "Last_Processing_Date.dd.mm.yyyy."

I would do this first:

z = read.table(file.choose(), sep=',', header=T,  as.is=TRUE, dec=".")

And then choose the columns to coerce to numeric:

z[ , 3:43] <- sapply(z[ , 3:43], as.numeric)

Keeping that date column intact in the 44th column. Then decide which columns should go into the zoo object.

Edit: I see Gabor Grothendieck has addressed these problems as well which is as it should be since he is one of the authors of the package.

Upvotes: 2

Josh O&#39;Brien
Josh O&#39;Brien

Reputation: 162341

Your data file poses two problems for read.zoo.

First, it uses N/A to denote missing values, rather than the string NA, which read.table() expects by default. This can be fixed by setting na.strings="N/A".

The second problem is that the data file's next-to-last column, Last_Processing_Date.dd.mm.yyyy, contains character strings.

But, according to the zoo FAQ document (warning, PDF):

A "zoo" object may be (1) a numeric vector, (2) a numeric matrix or (3) a factor but may not contain both a numeric vector and factor.

When 'asked' to read in a bunch of columns that contain both numeric character values, converting everything to factors is the only way that read.zoo() can produce an object fitting one of those three criteria.

If you remove the offending column, and specify your missing value string, everything works without a hitch. If you do need both numeric and factor columns, the FAQ linked above suggests several possible approaches.

z <- read.table("110329_110329_Chilbolton.lev10", sep=",", header=T,
                stringsAsFactors=FALSE, na.strings="N/A")
z$Last_Processing_Date.dd.mm.yyyy. <- NULL
z <- zoo(x=z[,-1:-2], order.by=f(z[[1]], z[[2]]))
summary(z)

     Index                       Julian_Day       AOT_1640      AOT_1020     
 Min.   :(03/29/11 09:26:28)   Min.   :88.39   Min.   : NA   Min.   :0.4404  
 1st Qu.:(03/29/11 09:41:28)   1st Qu.:88.40   1st Qu.: NA   1st Qu.:0.4902  
 Median :(03/29/11 10:11:29)   Median :88.42   Median : NA   Median :0.5651  
 Mean   :(03/29/11 10:00:44)   Mean   :88.42   Mean   :NaN   Mean   :0.5452  
 3rd Qu.:(03/29/11 10:17:46)   3rd Qu.:88.43   3rd Qu.: NA   3rd Qu.:0.5939  
 Max.   :(03/29/11 10:26:27)   Max.   :88.44   Max.   : NA   Max.   :0.6366  

Upvotes: 2

John Colby
John Colby

Reputation: 22588

The ... in read.zoo() will let you pass a stringsAsFactors = F on to read.table(). That should do the trick.

Upvotes: 1

Related Questions