Reputation: 903
I have a longitudinal data frame prueba
that follows different units (variable LA
) over time (variables time
and year
). The first 25 observations have the following structure.
> head(prueba, 25)
LA month year entry exit total homes
1 Barking and Dagenham 10 2010 2 0 2 NA
2 Barking and Dagenham 11 2010 3 0 3 NA
3 Barking and Dagenham 12 2010 3 0 3 15
4 Barking and Dagenham 1 2011 6 0 6 NA
5 Barking and Dagenham 2 2011 1 0 1 NA
6 Barking and Dagenham 3 2011 2 0 2 NA
7 Barking and Dagenham 4 2011 1 0 1 NA
8 Barking and Dagenham 10 2011 1 0 1 NA
9 Barking and Dagenham 11 2011 1 0 1 NA
10 Barking and Dagenham 1 2012 1 0 1 NA
11 Barking and Dagenham 9 2012 1 0 1 NA
12 Barking and Dagenham 6 2013 2 0 2 NA
13 Barking and Dagenham 1 2014 0 1 -1 NA
14 Barking and Dagenham 12 2014 0 1 -1 NA
15 Barking and Dagenham 3 2015 1 1 0 NA
16 Barking and Dagenham 11 2015 1 1 0 NA
17 Barking and Dagenham 12 2015 1 0 1 NA
18 Barnet 11 2010 24 0 24 NA
19 Barnet 12 2010 28 0 28 86
20 Barnet 1 2011 28 0 28 NA
21 Barnet 2 2011 6 0 6 NA
22 Barnet 3 2011 1 0 1 NA
23 Barnet 4 2011 1 0 1 NA
24 Barnet 7 2011 2 0 2 NA
25 Barnet 8 2011 1 0 1 NA
My aim is to recode the homes
variable by assigning the values that are not missing to the observations where month == "2"
and year == "2011"
. In case there are not observations for these values of month
and year
, the relabelled observations would be those corresponding to month == "1"
and year == "2011"
. Ideally, the expected output would be something like this:
> head(prueba, 25)
LA month year entry exit total homes
1 Barking and Dagenham 10 2010 2 0 2 NA
2 Barking and Dagenham 11 2010 3 0 3 NA
3 Barking and Dagenham 12 2010 3 0 3 NA
4 Barking and Dagenham 1 2011 6 0 6 NA
5 Barking and Dagenham 2 2011 1 0 1 15
6 Barking and Dagenham 3 2011 2 0 2 NA
7 Barking and Dagenham 4 2011 1 0 1 NA
8 Barking and Dagenham 10 2011 1 0 1 NA
9 Barking and Dagenham 11 2011 1 0 1 NA
10 Barking and Dagenham 1 2012 1 0 1 NA
11 Barking and Dagenham 9 2012 1 0 1 NA
12 Barking and Dagenham 6 2013 2 0 2 NA
13 Barking and Dagenham 1 2014 0 1 -1 NA
14 Barking and Dagenham 12 2014 0 1 -1 NA
15 Barking and Dagenham 3 2015 1 1 0 NA
16 Barking and Dagenham 11 2015 1 1 0 NA
17 Barking and Dagenham 12 2015 1 0 1 NA
18 Barnet 11 2010 24 0 24 NA
19 Barnet 12 2010 28 0 28 NA
20 Barnet 1 2011 28 0 28 NA
21 Barnet 2 2011 6 0 6 86
22 Barnet 3 2011 1 0 1 NA
23 Barnet 4 2011 1 0 1 NA
24 Barnet 7 2011 2 0 2 NA
25 Barnet 8 2011 1 0 1 NA
I have approached this by using data.table
on the following basis:
test = data.table(prueba)
setkey(test, LA)
test$homes =test[, .SD[, ifelse(year == "2011" & month == "2", !is.na(homes), homes)], by=LA]
But it is not yielding the expected output.
> head(test, 25)
LA month year entry exit total homes
1: Barking and Dagenham 10 2010 2 0 2 NA
2: Barking and Dagenham 11 2010 3 0 3 NA
3: Barking and Dagenham 12 2010 3 0 3 15
4: Barking and Dagenham 1 2011 6 0 6 NA
5: Barking and Dagenham 2 2011 1 0 1 NA
6: Barking and Dagenham 3 2011 2 0 2 NA
7: Barking and Dagenham 4 2011 1 0 1 NA
8: Barking and Dagenham 10 2011 1 0 1 NA
9: Barking and Dagenham 11 2011 1 0 1 NA
10: Barking and Dagenham 1 2012 1 0 1 NA
11: Barking and Dagenham 9 2012 1 0 1 NA
12: Barking and Dagenham 6 2013 2 0 2 NA
13: Barking and Dagenham 1 2014 0 1 -1 NA
14: Barking and Dagenham 12 2014 0 1 -1 NA
15: Barking and Dagenham 3 2015 1 1 0 NA
16: Barking and Dagenham 11 2015 1 1 0 NA
17: Barking and Dagenham 12 2015 1 0 1 NA
18: Barnet 11 2010 24 0 24 NA
19: Barnet 12 2010 28 0 28 86
20: Barnet 1 2011 28 0 28 NA
21: Barnet 2 2011 6 0 6 NA
22: Barnet 3 2011 1 0 1 NA
23: Barnet 4 2011 1 0 1 NA
24: Barnet 7 2011 2 0 2 NA
25: Barnet 8 2011 1 0 1 NA
LA month year entry exit total homes
I would be grateful if someone can suggest an alternative approach to this - not necessarily wit data.table
.
Upvotes: 0
Views: 194
Reputation: 3994
library(dplyr)
dfs <- data.frame(df %>%
group_by(LA) %>%
summarise(Homes = sum(homes, na.rm = T)) %>%
inner_join(.,df, by = 'LA') %>%
mutate(Homes = ifelse(month == 2 & year == 2011, Homes, NA)))
This should do the trick, and using the dplyr
package has a lot of speed as opposed to doing it iteratively (e.g. for
or while
).
Upvotes: 1