read.csv in R reading dates differently

Question

I have two very similar csv files. Stock prices for 2 different stocks downloaded from the same source in the same format. However, read.csv in R is reading them differently.

> tab1=read.csv(path1)
> tab2=read.csv(path2)

> head(tab1)
        Date   Open   High    Low  Close  Volume Adj.Close
1 2014-12-01 158.35 162.92 157.12 157.12 2719100  156.1488
2 2014-11-03 153.14 160.86 152.98 160.09 2243400  159.1004
3 2014-10-01 141.16 154.44 130.60 153.77 3825900  152.0036
4 2014-09-02 143.30 147.87 140.66 141.68 2592900  140.0525
5 2014-08-01 140.15 145.39 138.43 144.00 2027100  142.3459
6 2014-07-01 143.41 146.43 140.60 140.89 2131100  138.4461

> head(tab2)
       Date  Open  High   Low Close  Volume Adj.Close
1 12/1/2014 73.39 75.20 71.75 72.29 1561400  71.92211
2 11/3/2014 69.28 74.92 67.88 73.74 1421600  72.97650
3 10/1/2014 66.18 74.95 63.42 69.21 1775400  68.49341
4  9/2/2014 68.34 68.57 65.49 66.32 1249200  65.63333
5  8/1/2014 67.45 68.99 65.88 68.26 1655400  67.20743
6  7/1/2014 64.07 69.50 63.09 67.46 1733600  66.41976

If I try to use colClasses in read.csv then the dates for the second table are read incorrectly.

> tab1=read.csv(path1,colClasses=c("Date",rep("numeric",6)))
> tab2=read.csv(path2,colClasses=c("Date",rep("numeric",6)))

> head(tab1)
        Date   Open   High    Low  Close  Volume Adj.Close
1 2014-12-01 158.35 162.92 157.12 157.12 2719100  156.1488
2 2014-11-03 153.14 160.86 152.98 160.09 2243400  159.1004
3 2014-10-01 141.16 154.44 130.60 153.77 3825900  152.0036
4 2014-09-02 143.30 147.87 140.66 141.68 2592900  140.0525
5 2014-08-01 140.15 145.39 138.43 144.00 2027100  142.3459
6 2014-07-01 143.41 146.43 140.60 140.89 2131100  138.4461

> head(tab2)
        Date  Open  High   Low Close  Volume Adj.Close
1 0012-01-20 73.39 75.20 71.75 72.29 1561400  71.92211
2 0011-03-20 69.28 74.92 67.88 73.74 1421600  72.97650
3 0010-01-20 66.18 74.95 63.42 69.21 1775400  68.49341
4 0009-02-20 68.34 68.57 65.49 66.32 1249200  65.63333
5 0008-01-20 67.45 68.99 65.88 68.26 1655400  67.20743
6 0007-01-20 64.07 69.50 63.09 67.46 1733600  66.41976

Not sure how I can make this issue reproducible without attaching the .csv files. I'm attaching snapshots of the two files. Any help will be appreciated. tab1 tab2

Thanks

bgoldst · Accepted Answer

This can be solved by reading in the dates as a character vector and then calling strptime() inside transform():

transform(read.csv(path2,colClasses=c('character',rep('numeric',6))),Date=as.Date(strptime(Date,'%m/%d/%Y')));
##         Date  Open  High   Low Close  Volume Adj.Close
## 1 2014-12-01 73.39 75.20 71.75 72.29 1561400  71.92211
## 2 2014-11-03 69.28 74.92 67.88 73.74 1421600  72.97650
## 3 2014-10-01 66.18 74.95 63.42 69.21 1775400  68.49341
## 4 2014-09-02 68.34 68.57 65.49 66.32 1249200  65.63333
## 5 2014-08-01 67.45 68.99 65.88 68.26 1655400  67.20743
## 6 2014-07-01 64.07 69.50 63.09 67.46 1733600  66.41976

Edit: You can try to "detect" the date format dynamically using your own assumptions, but this will only be as reliable as your assumptions:

readStockData <- function(path) {
    tab <- read.csv(path,colClasses=c('character',rep('numeric',6)));
    tab$Date <- as.Date(tab$Date,if (grepl('^\d+/\d+/\d+$',tab$Date[1])) '%m/%d/%Y' else '%Y-%m-%d');
    tab;
};
readStockData(path1);
##         Date   Open   High    Low  Close  Volume Adj.Close
## 1 2014-12-01 158.35 162.92 157.12 157.12 2719100  156.1488
## 2 2014-11-03 153.14 160.86 152.98 160.09 2243400  159.1004
## 3 2014-10-01 141.16 154.44 130.60 153.77 3825900  152.0036
## 4 2014-09-02 143.30 147.87 140.66 141.68 2592900  140.0525
## 5 2014-08-01 140.15 145.39 138.43 144.00 2027100  142.3459
## 6 2014-07-01 143.41 146.43 140.60 140.89 2131100  138.4461
readStockData(path2);
##         Date  Open  High   Low Close  Volume Adj.Close
## 1 2014-12-01 73.39 75.20 71.75 72.29 1561400  71.92211
## 2 2014-11-03 69.28 74.92 67.88 73.74 1421600  72.97650
## 3 2014-10-01 66.18 74.95 63.42 69.21 1775400  68.49341
## 4 2014-09-02 68.34 68.57 65.49 66.32 1249200  65.63333
## 5 2014-08-01 67.45 68.99 65.88 68.26 1655400  67.20743
## 6 2014-07-01 64.07 69.50 63.09 67.46 1733600  66.41976

In the above I've made the assumption that there is at least one record in the file and that all records use the same Date format, thus the first Date value (tab$Date[1]) can be used for the detection.

read.csv in R reading dates differently

Answers (1)

Related Questions