Reputation: 1644
I would like to perform imputations on a variable such that it is constant within each id
. That is var1
can only have one distinct value for each and every id
.
One of it is to perform imputations given the number of occurrences of values of an existing variable as well as some other conditions.
df_old<- read.table(header = TRUE, text = "
date id var1
25/01/2016 1 A
02/05/2016 1 A
20/03/2016 1 B
13/07/2016 1 NA
20/03/2016 2 B
28/04/2016 2 C
20/03/2016 3 B
28/04/2016 3 OTHERS
28/04/2016 3 OTHERS
20/10/2016 4 NA
28/11/2016 4 NA
",stringsAsFactors=FALSE)
We focus on var1
here. Within each id
:
var1
has either A
, B
or C
, then we impute var1
with the value that occurs the most. e.g. for id=1
, it has two distinct var1
values (A
and B
), hence we impute it with A
since it occurs the most. Impute the missing value with A
also.var1
has either A
, B
or C
, and the two distinct var1
values have the same number of occurrences, then we impute the var1
value with the one that occurs the latest. For id=2
, it has one B
and one C
. We impute it with C
since the date corresponding to it occurs the lastest.var1
contains two distinct values one of it is either A
/B
/C
and the other is OTHERS
, we always impute using A
/B
/C
. Hence, for id=3
we impute using B
instead of OTHERS
.var1
contains only OTHERS
, then it will remain as OTHERS
.id=4
where it does not have any var1
values, they will remain as missing values.Hence, new dataframe should look like:
df_new<- read.table(header = TRUE, text = "
date id var1
25/01/2016 1 A
02/05/2016 1 A
20/03/2016 1 A
13/07/2016 1 A
20/03/2016 2 C
28/04/2016 2 C
20/03/2016 3 B
28/04/2016 3 B
28/04/2016 3 B
20/10/2016 4 NA
28/11/2016 4 NA
",stringsAsFactors=FALSE)`
I have tried to write out the first case, but I can't seem to solve it and hence cannot move on with the other cases.
library(dplyr)
df_old %>%
group_by(id,var1) %>%
mutate(n=n()) %>%
group_by(id) %>%
mutate(var1=if_else(n==min(n),var1[max(n)],var1[max(n)]))
or replace the last line with
replace(var1,which(min(n)),var1[which(max(n))]))
Both give errors:
Error: invalid 'type' (closure) of argument"
How do I rectify this?
Upvotes: 2
Views: 90
Reputation: 2496
you can use tidyverse
functions as:
dft %>%
group_by(id, var1) %>%
mutate(var2 = n()) %>%
mutate(var2 = if_else(var1 == "OTHERS", as.integer(0), as.integer(var2))) %>%
mutate( var2 = if_else( is.na(var2), as.integer(-1), as.integer(var2))) %>%
ungroup() %>%
group_by(id) %>%
arrange(desc(date)) %>%
mutate(var3 = var1[which.max(var2)]) %>%
arrange(id) %>%
select(date, id, var3)
which gives:
Source: local data frame [11 x 3]
Groups: id [4]
date id var3
<chr> <int> <chr>
1 25/01/2016 1 A
2 20/03/2016 1 A
3 13/07/2016 1 A
4 02/05/2016 1 A
5 28/04/2016 2 C
6 20/03/2016 2 C
7 28/04/2016 3 B
8 28/04/2016 3 B
9 20/03/2016 3 B
10 28/11/2016 4 <NA>
11 20/10/2016 4 <NA>
We used 0 and -1 to account for frequencies of OTHERS
and NA
. If you prefer a different priority order, change the values accordingly.
Upvotes: 2