bandcar
bandcar

Reputation: 723

change date to yyyy-mm in R

I have a data set that will be used for time series. the date column is currently structured as follows:

> head(cam_shiller)
    div stock  dates
1 0.495  7.09 1933m1
2 0.490  6.25 1933m2
3 0.485  6.23 1933m3
4 0.480  6.89 1933m4
5 0.475  8.87 1933m5
6 0.470 10.39 1933m6

If I'm not mistaken, monthly data for time series should look like this: yyyy-mm. So I'm trying to make my date column look like this:

    div stock  dates
1 0.495  7.09 1933-01
2 0.490  6.25 1933-02
3 0.485  6.23 1933-03
4 0.480  6.89 1933-04
5 0.475  8.87 1933-05
6 0.470 10.39 1933-06

However, using the as.yearmo function produces a column full of NAs. I tried removing the 'm' and replacing it with a dash, and then running as.yearmo again. Now the results look like this:

    div stock    dates
1 0.495  7.09 Jan 1933
2 0.490  6.25 Feb 1933
3 0.485  6.23 Mar 1933
4 0.480  6.89 Apr 1933
5 0.475  8.87 May 1933
6 0.470 10.39 Jun 1933

How do I change the dates into the yyyy-mm format?

library(zoo)

cam_shiller = read.csv('https://raw.githubusercontent.com/bandcar/Examples/main/cam_shiller.csv')

cam_shiller$dates = gsub('m', '-', cam_shiller$dates)
cam_shiller$dates = as.yearmon(cam_shiller$dates)

Upvotes: 1

Views: 145

Answers (3)

G. Grothendieck
G. Grothendieck

Reputation: 269371

The form in the question is already correct. It is not true that you need to change it. It renders as Jan 1933, etc. but internally it is represented as year+(month-1)/12 (where month is a number 1, 2, ..., 12) which is exactly what you need for analysis. You do not want a character string of the form yyyy-mm for analysis.

If by "time series" you mean a zoo series then using u defined in the Note at the end, z below gives that with a yearmon index. The index argument to read.csv.zoo gives the column number or name of the index, the FUN argument tells it how to convert it and the format argument tells it the precise form of the dates.

If what you mean by time series is that you want a ts series then tt below gives that.

If what you mean is a data frame with a yearmon column then DF below gives that.

With either a zoo series or a ts series one could perform a variety of analyses. For example, acf(z) or acf(tt) would give the autocorrelation function.

For more information see ?read.csv.zoo . There is also an entire vignette on read.zoo and its variants. The vignettes are linked to on the CRAN home page for zoo. Also see ?strptime for the percent codes.

library(zoo)

# zoo series with yearmon column
z <- read.csv.zoo(u, index = 3, FUN = as.yearmon, format = "%Ym%m")

# ts series
tt <- as.ts(z)

# data frame with yearmon column
DF <- u |>
  read.csv() |>
  transform(dates = as.yearmon(dates, "%Ym%m"))

A character string of the form yyyy-mm is not a suitable form for most analyses but if you really did want that anyways then

 # zoo series with yyyy-mm character string index
 z2 <- aggregate(z, format(index(z), "%Y-%m"), c)

 # data.frame with yyyy-mm character string column
 DF2 <- transform(DF, dates = format(dates, "%Y-%m"))

Note

u <- "https://raw.githubusercontent.com/bandcar/Examples/main/cam_shiller.csv"

Upvotes: 0

jay.sf
jay.sf

Reputation: 72593

Actually, in ts you just need to specify start= and frequency.

res <- ts(cam_shiller[, -3], start=1933, frequency=12)
res
#             div stock
# Jan 1933 0.4950  7.09
# Feb 1933 0.4900  6.25
# Mar 1933 0.4850  6.23
# Apr 1933 0.4800  6.89
# May 1933 0.4750  8.87
# Jun 1933 0.4700 10.39
# Jul 1933 0.4650 11.23
# Aug 1933 0.4600 10.67
# Sep 1933 0.4550 10.58
# Oct 1933 0.4500  9.55
# Nov 1933 0.4450  9.78
# Dec 1933 0.4400  9.97
# Jan 1934 0.4408 10.54
# Feb 1934 0.4417 11.32
# Mar 1934 0.4425 10.74
# Apr 1934 0.4433 10.92
# May 1934 0.4442  9.81
# Jun 1934 0.4450  9.94
# Jul 1934 0.4458  9.47
# Aug 1934 0.4467  9.10
# Sep 1934 0.4475  8.88
# Oct 1934 0.4483  8.95
# Nov 1934 0.4492  9.20
# Dec 1934 0.4500  9.26
# ...

Or

ts(cam_shiller$stock, start=c(1933, 1), frequency=12)
#  Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct    Nov    Dec
# 1933   7.09   6.25   6.23   6.89   8.87  10.39  11.23  10.67  10.58   9.55   9.78   9.97
# 1934  10.54  11.32  10.74  10.92   9.81   9.94   9.47   9.10   8.88   8.95   9.20   9.26
# 1935   9.26   8.98   8.41   9.04   9.75  10.12  10.65  11.37  11.61  11.92  13.04  13.04
# ...

It may be wise to check beforehand that there are no gaps in the data by evaluating the column and row variances of years and month matrices:

test <- do.call(rbind, strsplit(cam_shiller$dates, 'm')) |>
  type.convert(as.is=TRUE) 
matrixStats::colVars(matrix(test[, 1], 12))
#  [1] 0 0 ...
matrixStats::rowVars(matrix(test[, 2], 12))
# [1] 0 0 0 0 0 0 0 0 0 0 0 0

If you use the xts::xts, it's rather picky since it wants a time-based class such as "Date" or "POSIXct". So you need whole dates, i.e. paste a 01 as pseudo date.

res <- transform(cam_shiller, dates=strptime(paste(dates, '01'), format='%Ym%m %d')) |>
  {\(.) xts::as.xts(.[1:2], .$dates)}()
head(res)
#              div stock
# 1933-01-01 0.495  7.09
# 1933-02-01 0.490  6.25
# 1933-03-01 0.485  6.23
# 1933-04-01 0.480  6.89
# 1933-05-01 0.475  8.87
# 1933-06-01 0.470 10.39

class(res)
# [1] "xts" "zoo"

Data:

cam_shiller <- structure(list(div = c(0.495, 0.49, 0.485, 0.48, 0.475, 0.47, 
0.465, 0.46, 0.455, 0.45, 0.445, 0.44, 0.4408, 0.4417, 0.4425, 
0.4433, 0.4442, 0.445, 0.4458, 0.4467, 0.4475, 0.4483, 0.4492, 
0.45), stock = c(7.09, 6.25, 6.23, 6.89, 8.87, 10.39, 11.23, 
10.67, 10.58, 9.55, 9.78, 9.97, 10.54, 11.32, 10.74, 10.92, 9.81, 
9.94, 9.47, 9.1, 8.88, 8.95, 9.2, 9.26), dates = c("1933m1", 
"1933m2", "1933m3", "1933m4", "1933m5", "1933m6", "1933m7", "1933m8", 
"1933m9", "1933m10", "1933m11", "1933m12", "1934m1", "1934m2", 
"1934m3", "1934m4", "1934m5", "1934m6", "1934m7", "1934m8", "1934m9", 
"1934m10", "1934m11", "1934m12")), row.names = c(NA, 24L), class = "data.frame")

Upvotes: 2

Shafee
Shafee

Reputation: 19857

Try lubridate::ym to change dates to yyyy-mm format

library(tidyverse)

cam_shiller = read.csv('https://raw.githubusercontent.com/bandcar/Examples/main/cam_shiller.csv')

cam_shiller %>% 
  mutate(
    date = lubridate::ym(dates),
    date = strftime(date, "%Y-%m")
  ) %>% 
  head()

#>     div stock  dates    date
#> 1 0.495  7.09 1933m1 1933-01
#> 2 0.490  6.25 1933m2 1933-02
#> 3 0.485  6.23 1933m3 1933-03
#> 4 0.480  6.89 1933m4 1933-04
#> 5 0.475  8.87 1933m5 1933-05
#> 6 0.470 10.39 1933m6 1933-06

Created on 2022-10-01 with reprex v2.0.2

Upvotes: 1

Related Questions