Reputation: 103
I have two very similar csv files. Stock prices for 2 different stocks downloaded from the same source in the same format. However, read.csv in R is reading them differently.
> tab1=read.csv(path1)
> tab2=read.csv(path2)
> head(tab1)
Date Open High Low Close Volume Adj.Close
1 2014-12-01 158.35 162.92 157.12 157.12 2719100 156.1488
2 2014-11-03 153.14 160.86 152.98 160.09 2243400 159.1004
3 2014-10-01 141.16 154.44 130.60 153.77 3825900 152.0036
4 2014-09-02 143.30 147.87 140.66 141.68 2592900 140.0525
5 2014-08-01 140.15 145.39 138.43 144.00 2027100 142.3459
6 2014-07-01 143.41 146.43 140.60 140.89 2131100 138.4461
> head(tab2)
Date Open High Low Close Volume Adj.Close
1 12/1/2014 73.39 75.20 71.75 72.29 1561400 71.92211
2 11/3/2014 69.28 74.92 67.88 73.74 1421600 72.97650
3 10/1/2014 66.18 74.95 63.42 69.21 1775400 68.49341
4 9/2/2014 68.34 68.57 65.49 66.32 1249200 65.63333
5 8/1/2014 67.45 68.99 65.88 68.26 1655400 67.20743
6 7/1/2014 64.07 69.50 63.09 67.46 1733600 66.41976
If I try to use colClasses in read.csv then the dates for the second table are read incorrectly.
> tab1=read.csv(path1,colClasses=c("Date",rep("numeric",6)))
> tab2=read.csv(path2,colClasses=c("Date",rep("numeric",6)))
> head(tab1)
Date Open High Low Close Volume Adj.Close
1 2014-12-01 158.35 162.92 157.12 157.12 2719100 156.1488
2 2014-11-03 153.14 160.86 152.98 160.09 2243400 159.1004
3 2014-10-01 141.16 154.44 130.60 153.77 3825900 152.0036
4 2014-09-02 143.30 147.87 140.66 141.68 2592900 140.0525
5 2014-08-01 140.15 145.39 138.43 144.00 2027100 142.3459
6 2014-07-01 143.41 146.43 140.60 140.89 2131100 138.4461
> head(tab2)
Date Open High Low Close Volume Adj.Close
1 0012-01-20 73.39 75.20 71.75 72.29 1561400 71.92211
2 0011-03-20 69.28 74.92 67.88 73.74 1421600 72.97650
3 0010-01-20 66.18 74.95 63.42 69.21 1775400 68.49341
4 0009-02-20 68.34 68.57 65.49 66.32 1249200 65.63333
5 0008-01-20 67.45 68.99 65.88 68.26 1655400 67.20743
6 0007-01-20 64.07 69.50 63.09 67.46 1733600 66.41976
Not sure how I can make this issue reproducible without attaching the .csv files. I'm attaching snapshots of the two files. Any help will be appreciated.
Thanks
Upvotes: 3
Views: 5529
Reputation: 35314
This can be solved by reading in the dates as a character vector and then calling strptime()
inside transform()
:
transform(read.csv(path2,colClasses=c('character',rep('numeric',6))),Date=as.Date(strptime(Date,'%m/%d/%Y')));
## Date Open High Low Close Volume Adj.Close
## 1 2014-12-01 73.39 75.20 71.75 72.29 1561400 71.92211
## 2 2014-11-03 69.28 74.92 67.88 73.74 1421600 72.97650
## 3 2014-10-01 66.18 74.95 63.42 69.21 1775400 68.49341
## 4 2014-09-02 68.34 68.57 65.49 66.32 1249200 65.63333
## 5 2014-08-01 67.45 68.99 65.88 68.26 1655400 67.20743
## 6 2014-07-01 64.07 69.50 63.09 67.46 1733600 66.41976
Edit: You can try to "detect" the date format dynamically using your own assumptions, but this will only be as reliable as your assumptions:
readStockData <- function(path) {
tab <- read.csv(path,colClasses=c('character',rep('numeric',6)));
tab$Date <- as.Date(tab$Date,if (grepl('^\\d+/\\d+/\\d+$',tab$Date[1])) '%m/%d/%Y' else '%Y-%m-%d');
tab;
};
readStockData(path1);
## Date Open High Low Close Volume Adj.Close
## 1 2014-12-01 158.35 162.92 157.12 157.12 2719100 156.1488
## 2 2014-11-03 153.14 160.86 152.98 160.09 2243400 159.1004
## 3 2014-10-01 141.16 154.44 130.60 153.77 3825900 152.0036
## 4 2014-09-02 143.30 147.87 140.66 141.68 2592900 140.0525
## 5 2014-08-01 140.15 145.39 138.43 144.00 2027100 142.3459
## 6 2014-07-01 143.41 146.43 140.60 140.89 2131100 138.4461
readStockData(path2);
## Date Open High Low Close Volume Adj.Close
## 1 2014-12-01 73.39 75.20 71.75 72.29 1561400 71.92211
## 2 2014-11-03 69.28 74.92 67.88 73.74 1421600 72.97650
## 3 2014-10-01 66.18 74.95 63.42 69.21 1775400 68.49341
## 4 2014-09-02 68.34 68.57 65.49 66.32 1249200 65.63333
## 5 2014-08-01 67.45 68.99 65.88 68.26 1655400 67.20743
## 6 2014-07-01 64.07 69.50 63.09 67.46 1733600 66.41976
In the above I've made the assumption that there is at least one record in the file and that all records use the same Date format, thus the first Date value (tab$Date[1]
) can be used for the detection.
Upvotes: 3