Reputation: 13
The problem
I have a dataset with 3 variables : an ID variable, a TIME variable and a numeric variable X that is presented in the dataset only if it is different from zero such as in the following table.
time ID X
238 2007 A 28
239 2008 A 80
240 2014 A 178
241 2012 B 88
242 2011 C 369
243 2003 D 28
244 2004 D 80
I would like to have in a unique dataframe with a lign for each ID and each year between 2001 and 2016 with X=0 when necesary. So it would be a table such as the following one :
time ID X
1 2001 A 0
2 2002 A 0
7 2007 A 28
8 2008 A 80
14 2014 A 178
17 2001 B 0
7 2012 B 88
I have found no way to do it in the R documentation or in this forum
How I proceed
To solve this problem, i had the idea to proceed in three steps:
1) I keep only one line for each ID number, no matter what year it is
data2 = data%>%group_by(ID,X)%>%distinct(.keep_all = T)
That lead to the following table:
time ID X
238 2007 A 28
241 2012 B 88
242 2011 C 369
243 2003 D 28
2) Then I duplicate each line to have observation for every year
timebis = seq(from = 2001, to = 2016, by = 1)
dupl.data2 = data2[rep(1:nrow(data1), each=length(timebis)), ]
dupl.data1$X = 0
I have now 16 observations for each ID.
3) Fill the dupl.data1$X column with a double for loop
i=1
j=1
for(i in 1:length(dupl.data2$ID)){
for(j in 1:length(data$ID)){
if (dupl.data2$timebis[i]==data$time[j] & dupl.data2$ID[i]==data$ID[j])
{dupl.data2$X[i]=data$X[j]}
j=j+1
}
j=1
i=i+1
}
Conclusion
It works well on small sub-samples but my original database have about 300 000 observations and the dataset with zeros is much larger. I would need to improve my code efficiency or ideas to solve this issue.
Thanks
Upvotes: 1
Views: 1116
Reputation: 6020
expand
to get all combinations of ID and year. Then left_join
the result on the dataframe again to get the x values.
require(dplyr)
require(tidyr)
df <- data.frame(time = sample(2001:2012,12,replace = F),
ID = sample(LETTERS[1:3],12, replace =T),
x = sample(10:50,12))
df %>%
expand(time,ID) %>%
left_join(df, c('time','ID')) %>%
mutate(x = ifelse(is.na(x),0,x))
result:
# A tibble: 36 × 3
time ID x
<int> <fctr> <dbl>
2001 A 0
2001 B 39
2001 C 0
2002 A 0
2002 B 0
2002 C 24
2003 A 35
2003 B 0
2003 C 0
2004 A 47
# ... with 26 more rows
If not all the years are in your dataframe for some reason, you can use expand with custom values.
df %>%
expand(time = 2001:2012,ID) %>%
left_join(df, c('time','ID')) %>%
mutate(x = ifelse(is.na(x),0,x))
Upvotes: 2