Reputation: 23
I have a question regarding creating a new specific variable in a large dataset (36000 rows). I want to create a variable that is cumulative counting the days. So for example:
Date (dd/mm/yyyy)
01-01-2018
01-01-2018
02-01-2018
02-01-2018
02-01-2018
02-01-2018
03-01-2018
03-01-2018
I want to create the following numeric variable which counts the unique days in order, so:
New Variable
1
1
2
2
2
2
3
3
So, the first unique date gets a 1, the second unique date a 2, etc. In total I have more than 36000 observations, where there are 1096 unique dates (three years). So I want a variable that is having a 1 for the first date, a two for the second date, and a 1096 for the last date. However, the dates do not appear equally: one unique date appears maybe 30 times, the other unique date 50 times.
How can I create a variable like this in R?
Upvotes: 2
Views: 81
Reputation: 160852
Two methods:
if you are certain that they are ordered correctly, then you can use:
cumsum(c(TRUE, diff(x$Date) != 0))
# [1] 1 2 2 3 4 4 5 5
if you are not certain or just want to guard against it, you can borrow from how factor
s are stored:
# randomize the data for this example
set.seed(2)
x <- x[sample(seq_len(nrow(x))),,drop=FALSE]
x
# Date
# 2 2018-01-01
# 5 2018-01-02
# 4 2018-01-02
# 1 2018-01-01
# 6 2018-01-02
# 3 2018-01-02
# 7 2018-01-03
# 8 2018-01-03
# this is the real work, works even if ordered
x$NewVar <- as.integer(factor(x$Date))
x
# Date NewVar
# 2 2018-01-01 1
# 5 2018-01-02 2
# 4 2018-01-02 2
# 1 2018-01-01 1
# 6 2018-01-02 2
# 3 2018-01-02 2
# 7 2018-01-03 3
# 8 2018-01-03 3
# reorder for presentation here:
x[order(x$NewVar),]
# Date NewVar
# 2 2018-01-01 1
# 1 2018-01-01 1
# 5 2018-01-02 2
# 4 2018-01-02 2
# 6 2018-01-02 2
# 3 2018-01-02 2
# 7 2018-01-03 3
# 8 2018-01-03 3
Data:
x <- read.table(stringsAsFactors=FALSE, header=TRUE, text="
Date
01-01-2018
01-01-2018
02-01-2018
02-01-2018
02-01-2018
02-01-2018
03-01-2018
03-01-2018")
x$Date <- as.Date(x$Date, format = "%d-%m-%Y")
Upvotes: 2