Erwan
Erwan

Reputation: 13

R: For loop taking too long

The problem

I have a dataset with 3 variables : an ID variable, a TIME variable and a numeric variable X that is presented in the dataset only if it is different from zero such as in the following table.

    time  ID  X
238 2007  A   28
239 2008  A   80
240 2014  A   178
241 2012  B   88
242 2011  C   369
243 2003  D   28
244 2004  D   80

I would like to have in a unique dataframe with a lign for each ID and each year between 2001 and 2016 with X=0 when necesary. So it would be a table such as the following one :

   time  ID  X
1  2001  A   0
2  2002  A   0

7  2007  A   28
8  2008  A   80

14 2014  A   178
17 2001  B   0
7  2012  B   88

I have found no way to do it in the R documentation or in this forum

How I proceed

To solve this problem, i had the idea to proceed in three steps:

1) I keep only one line for each ID number, no matter what year it is

data2 = data%>%group_by(ID,X)%>%distinct(.keep_all = T)

That lead to the following table:

    time  ID  X
238 2007  A   28
241 2012  B   88
242 2011  C   369
243 2003  D   28

2) Then I duplicate each line to have observation for every year

timebis = seq(from = 2001, to = 2016, by = 1)
dupl.data2 = data2[rep(1:nrow(data1), each=length(timebis)), ]
dupl.data1$X = 0

I have now 16 observations for each ID.

3) Fill the dupl.data1$X column with a double for loop

i=1
j=1 
for(i in 1:length(dupl.data2$ID)){
  for(j in 1:length(data$ID)){
    if (dupl.data2$timebis[i]==data$time[j] & dupl.data2$ID[i]==data$ID[j]) 
       {dupl.data2$X[i]=data$X[j]} 
  j=j+1  
}
  j=1 
  i=i+1
}

Conclusion

It works well on small sub-samples but my original database have about 300 000 observations and the dataset with zeros is much larger. I would need to improve my code efficiency or ideas to solve this issue.

Thanks

Upvotes: 1

Views: 1116

Answers (1)

Wietze314
Wietze314

Reputation: 6020

expand to get all combinations of ID and year. Then left_join the result on the dataframe again to get the x values.

require(dplyr)
require(tidyr) 
df <- data.frame(time = sample(2001:2012,12,replace = F),
                ID = sample(LETTERS[1:3],12, replace =T),
                x = sample(10:50,12))


df %>% 
  expand(time,ID) %>% 
  left_join(df, c('time','ID')) %>% 
  mutate(x = ifelse(is.na(x),0,x))

result:

# A tibble: 36 × 3
time     ID     x
<int> <fctr> <dbl>
2001      A     0
2001      B    39
2001      C     0
2002      A     0
2002      B     0
2002      C    24
2003      A    35
2003      B     0
2003      C     0
2004      A    47
# ... with 26 more rows

If not all the years are in your dataframe for some reason, you can use expand with custom values.

df %>% 
  expand(time = 2001:2012,ID) %>% 
  left_join(df, c('time','ID')) %>% 
  mutate(x = ifelse(is.na(x),0,x))

Upvotes: 2

Related Questions