Reputation: 342

R loop too long

I am dealing with a 2M rows DB and my if loop is taking too long.

The data base has 3 variables : Gift_ID, ind_id and gift_date with data and two empty variables which I want to fill in min and max. For every different id (ind_id) I want to identify the first gift date and the last gift date. I've written these two loops that worked (the loops work if the DB is ordered by ind_ID and then gift_date), but when I run them with the hole DB it takes too long. Any ideas of another way of writing this?

for (i in 2:length(NewGifts[,1])){
  if(NewGifts$ind_id[i] != NewGifts$ind_id[i-1]){
NewGifts$min[i] = format.Date(NewGifts$gift_date[i], '%Y%m%d')
  } else {
    NewGifts$min[i] = NewGifts$min[i-1]
  }
}


for (i in ((length(NewGifts[,1])-1):1)){
  if(NewGifts$ind_id[i] != NewGifts$ind_id[i+1]){
    NewGifts$max[i] = format.Date(NewGifts$gift_date[i], '%Y%m%d')
  } else {
NewGifts$max[i] = NewGifts$max[i+1]
  }
}

Maybe working with data.table is more efficient, but I couldn't find any post helping.

Sample of the data:

ind_id   gift_id   gift_date  
1        1         20190101 
1        2         20190102 
1        3         20190103 
1        4         20190104 
2        5         20190201 
2        6         20190319 
2        7         20190412 
2        8         20190511 
2        9         20190601

Expected result :

ind_id   gift_id   gift_date    min         max  
1        1         20190101     20190101    20190104
1        2         20190102     20190101    20190104
1        3         20190103     20190101    20190104
1        4         20190104     20190101    20190104
2        5         20190201     20190201    20190601 
2        6         20190319     20190201    20190601 
2        7         20190412     20190201    20190601 
2        8         20190511     20190201    20190601 
2        9         20190601     20190201    20190601

Upvotes: 2

Answers (3)

akrun

Reputation: 887881

We can use also do this in tidyverse with

library(tidyr)
library(dplyr)
library(lubridate)
df %>%
   group_by(ind_id) %>%
   mutate(minmax = list(set_names(range(ymd(gift_date)), c('min', 'max')))) %>% 
   unnest_wider(minmax)
# A tibble: 9 x 5
# Groups:   ind_id [2]
#  ind_id gift_id gift_date min        max       
#   <int>   <int>     <int> <date>     <date>    
#1      1       1  20190101 2019-01-01 2019-01-04
#2      1       2  20190102 2019-01-01 2019-01-04
#3      1       3  20190103 2019-01-01 2019-01-04
#4      1       4  20190104 2019-01-01 2019-01-04
#5      2       5  20190201 2019-02-01 2019-06-01
#6      2       6  20190319 2019-02-01 2019-06-01
#7      2       7  20190412 2019-02-01 2019-06-01
#8      2       8  20190511 2019-02-01 2019-06-01
#9      2       9  20190601 2019-02-01 2019-06-01

data

df <- structure(list(ind_id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), 
gift_id = 1:9, gift_date = c(20190101L, 20190102L, 20190103L, 
20190104L, 20190201L, 20190319L, 20190412L, 20190511L, 20190601L
)), class = "data.frame", row.names = c(NA, -9L))

Upvotes: 0

JDG

Reputation: 1364

Below is a(nother) data.table solution.

Data

df = data.table(
  id = sample(LETTERS[1:10], 100, replace = T),
  gift_date = as.Date(sapply(1:100, function(x) as.Date('20000101', '%Y%m%d') + x))
)

Solution

mat = df[, .(first = min(gift_date), last = max(gift_date)), id]
> mat
    id      first       last
 1:  E 2000-01-02 2000-04-06
 2:  G 2000-01-03 2000-04-03
 3:  F 2000-01-04 2000-04-07
 4:  A 2000-01-05 2000-04-04
 5:  H 2000-01-06 2000-04-10
 6:  C 2000-01-07 2000-03-26
 7:  I 2000-01-09 2000-03-27
 8:  D 2000-01-12 2000-04-01
 9:  J 2000-01-18 2000-03-05
10:  B 2000-01-20 2000-04-08

Upvotes: 1

Ronak Shah

Reputation: 389265

Convert your gift_date to actual date object then it would be easy to do this by group and select max and min values. Using dplyr

library(dplyr)
df %>%
  mutate(gift_date = as.Date(as.character(gift_date), "%Y%m%d")) %>%
  group_by(ind_id) %>%
  mutate(min = min(gift_date), 
         max = max(gift_date))


#  ind_id gift_id gift_date  min        max       
#   <int>   <int> <date>     <date>     <date>    
#1      1       1 2019-01-01 2019-01-01 2019-01-04
#2      1       2 2019-01-02 2019-01-01 2019-01-04
#3      1       3 2019-01-03 2019-01-01 2019-01-04
#4      1       4 2019-01-04 2019-01-01 2019-01-04
#5      2       5 2019-02-01 2019-02-01 2019-06-01
#6      2       6 2019-03-19 2019-02-01 2019-06-01
#7      2       7 2019-04-12 2019-02-01 2019-06-01
#8      2       8 2019-05-11 2019-02-01 2019-06-01
#9      2       9 2019-06-01 2019-02-01 2019-06-01

Or using data.table might be more efficient.

library(data.table)
df$gift_date <- as.Date(as.character(df$gift_date), "%Y%m%d")
setDT(df)[, `:=` (min = min(gift_date), max = max(gift_date)), by = ind_id]

data

df <- structure(list(ind_id = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), 
gift_id = 1:9, gift_date = c(20190101L, 20190102L, 20190103L, 
20190104L, 20190201L, 20190319L, 20190412L, 20190511L, 20190601L
)), class = "data.frame", row.names = c(NA, -9L))

Upvotes: 4

R loop too long

Answers (3)

data

Related Questions