Reputation: 342
I am dealing with a 2M rows DB and my if loop is taking too long.
The data base has 3 variables : Gift_ID
, ind_id
and gift_date
with data and two empty variables which I want to fill in min
and max
. For every different id (ind_id
) I want to identify the first gift date and the last gift date. I've written these two loops that worked (the loops work if the DB is ordered by ind_ID
and then gift_date
), but when I run them with the hole DB it takes too long. Any ideas of another way of writing this?
for (i in 2:length(NewGifts[,1])){
if(NewGifts$ind_id[i] != NewGifts$ind_id[i-1]){
NewGifts$min[i] = format.Date(NewGifts$gift_date[i], '%Y%m%d')
} else {
NewGifts$min[i] = NewGifts$min[i-1]
}
}
for (i in ((length(NewGifts[,1])-1):1)){
if(NewGifts$ind_id[i] != NewGifts$ind_id[i+1]){
NewGifts$max[i] = format.Date(NewGifts$gift_date[i], '%Y%m%d')
} else {
NewGifts$max[i] = NewGifts$max[i+1]
}
}
Maybe working with data.table is more efficient, but I couldn't find any post helping.
Sample of the data:
ind_id gift_id gift_date
1 1 20190101
1 2 20190102
1 3 20190103
1 4 20190104
2 5 20190201
2 6 20190319
2 7 20190412
2 8 20190511
2 9 20190601
Expected result :
ind_id gift_id gift_date min max
1 1 20190101 20190101 20190104
1 2 20190102 20190101 20190104
1 3 20190103 20190101 20190104
1 4 20190104 20190101 20190104
2 5 20190201 20190201 20190601
2 6 20190319 20190201 20190601
2 7 20190412 20190201 20190601
2 8 20190511 20190201 20190601
2 9 20190601 20190201 20190601
Upvotes: 2
Views: 137
Reputation: 887881
We can use also do this in tidyverse
with
library(tidyr)
library(dplyr)
library(lubridate)
df %>%
group_by(ind_id) %>%
mutate(minmax = list(set_names(range(ymd(gift_date)), c('min', 'max')))) %>%
unnest_wider(minmax)
# A tibble: 9 x 5
# Groups: ind_id [2]
# ind_id gift_id gift_date min max
# <int> <int> <int> <date> <date>
#1 1 1 20190101 2019-01-01 2019-01-04
#2 1 2 20190102 2019-01-01 2019-01-04
#3 1 3 20190103 2019-01-01 2019-01-04
#4 1 4 20190104 2019-01-01 2019-01-04
#5 2 5 20190201 2019-02-01 2019-06-01
#6 2 6 20190319 2019-02-01 2019-06-01
#7 2 7 20190412 2019-02-01 2019-06-01
#8 2 8 20190511 2019-02-01 2019-06-01
#9 2 9 20190601 2019-02-01 2019-06-01
df <- structure(list(ind_id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
gift_id = 1:9, gift_date = c(20190101L, 20190102L, 20190103L,
20190104L, 20190201L, 20190319L, 20190412L, 20190511L, 20190601L
)), class = "data.frame", row.names = c(NA, -9L))
Upvotes: 0
Reputation: 1364
Below is a(nother) data.table solution.
Data
df = data.table(
id = sample(LETTERS[1:10], 100, replace = T),
gift_date = as.Date(sapply(1:100, function(x) as.Date('20000101', '%Y%m%d') + x))
)
Solution
mat = df[, .(first = min(gift_date), last = max(gift_date)), id]
> mat
id first last
1: E 2000-01-02 2000-04-06
2: G 2000-01-03 2000-04-03
3: F 2000-01-04 2000-04-07
4: A 2000-01-05 2000-04-04
5: H 2000-01-06 2000-04-10
6: C 2000-01-07 2000-03-26
7: I 2000-01-09 2000-03-27
8: D 2000-01-12 2000-04-01
9: J 2000-01-18 2000-03-05
10: B 2000-01-20 2000-04-08
Upvotes: 1
Reputation: 389265
Convert your gift_date
to actual date object then it would be easy to do this by group and select max and min values. Using dplyr
library(dplyr)
df %>%
mutate(gift_date = as.Date(as.character(gift_date), "%Y%m%d")) %>%
group_by(ind_id) %>%
mutate(min = min(gift_date),
max = max(gift_date))
# ind_id gift_id gift_date min max
# <int> <int> <date> <date> <date>
#1 1 1 2019-01-01 2019-01-01 2019-01-04
#2 1 2 2019-01-02 2019-01-01 2019-01-04
#3 1 3 2019-01-03 2019-01-01 2019-01-04
#4 1 4 2019-01-04 2019-01-01 2019-01-04
#5 2 5 2019-02-01 2019-02-01 2019-06-01
#6 2 6 2019-03-19 2019-02-01 2019-06-01
#7 2 7 2019-04-12 2019-02-01 2019-06-01
#8 2 8 2019-05-11 2019-02-01 2019-06-01
#9 2 9 2019-06-01 2019-02-01 2019-06-01
Or using data.table
might be more efficient.
library(data.table)
df$gift_date <- as.Date(as.character(df$gift_date), "%Y%m%d")
setDT(df)[, `:=` (min = min(gift_date), max = max(gift_date)), by = ind_id]
data
df <- structure(list(ind_id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
gift_id = 1:9, gift_date = c(20190101L, 20190102L, 20190103L,
20190104L, 20190201L, 20190319L, 20190412L, 20190511L, 20190601L
)), class = "data.frame", row.names = c(NA, -9L))
Upvotes: 4