HNSKD
HNSKD

Reputation: 1644

How to perform imputations on a variable given the number of occurrences of values for the same variable?

I would like to perform imputations on a variable such that it is constant within each id. That is var1 can only have one distinct value for each and every id.

One of it is to perform imputations given the number of occurrences of values of an existing variable as well as some other conditions.

df_old<- read.table(header = TRUE, text = "
date       id  var1 
25/01/2016  1  A 
02/05/2016  1  A 
20/03/2016  1  B 
13/07/2016  1  NA
20/03/2016  2  B 
28/04/2016  2  C
20/03/2016  3  B 
28/04/2016  3  OTHERS
28/04/2016  3  OTHERS 
20/10/2016  4  NA 
28/11/2016  4  NA
",stringsAsFactors=FALSE)

We focus on var1 here. Within each id:

  1. If var1 has either A, B or C, then we impute var1 with the value that occurs the most. e.g. for id=1, it has two distinct var1 values (A and B), hence we impute it with A since it occurs the most. Impute the missing value with A also.
  2. If var1 has either A, B or C, and the two distinct var1 values have the same number of occurrences, then we impute the var1 value with the one that occurs the latest. For id=2, it has one B and one C. We impute it with C since the date corresponding to it occurs the lastest.
  3. If var1 contains two distinct values one of it is either A/B/C and the other is OTHERS, we always impute using A/B/C. Hence, for id=3 we impute using B instead of OTHERS.
  4. If var1 contains only OTHERS, then it will remain as OTHERS.
  5. For id=4 where it does not have any var1 values, they will remain as missing values.

Hence, new dataframe should look like:

df_new<- read.table(header = TRUE, text = "
date       id  var1 
25/01/2016  1  A 
02/05/2016  1  A 
20/03/2016  1  A 
13/07/2016  1  A
20/03/2016  2  C 
28/04/2016  2  C
20/03/2016  3  B 
28/04/2016  3  B
28/04/2016  3  B 
20/10/2016  4  NA 
28/11/2016  4  NA
",stringsAsFactors=FALSE)`

I have tried to write out the first case, but I can't seem to solve it and hence cannot move on with the other cases.

library(dplyr)
df_old %>%
group_by(id,var1) %>%
mutate(n=n()) %>%
group_by(id) %>%
mutate(var1=if_else(n==min(n),var1[max(n)],var1[max(n)]))

or replace the last line with replace(var1,which(min(n)),var1[which(max(n))]))

Both give errors:

Error: invalid 'type' (closure) of argument"

How do I rectify this?

Upvotes: 2

Views: 90

Answers (1)

Aramis7d
Aramis7d

Reputation: 2496

you can use tidyverse functions as:

dft %>% 
      group_by(id, var1) %>%
      mutate(var2 = n()) %>%
      mutate(var2 = if_else(var1 == "OTHERS", as.integer(0), as.integer(var2))) %>%
      mutate( var2 = if_else( is.na(var2), as.integer(-1), as.integer(var2))) %>%
      ungroup() %>%
      group_by(id) %>%
      arrange(desc(date)) %>%
      mutate(var3 = var1[which.max(var2)]) %>%
      arrange(id) %>%
      select(date, id, var3)

which gives:

Source: local data frame [11 x 3]
Groups: id [4]

         date    id  var3
        <chr> <int> <chr>
1  25/01/2016     1     A
2  20/03/2016     1     A
3  13/07/2016     1     A
4  02/05/2016     1     A
5  28/04/2016     2     C
6  20/03/2016     2     C
7  28/04/2016     3     B
8  28/04/2016     3     B
9  20/03/2016     3     B
10 28/11/2016     4  <NA>
11 20/10/2016     4  <NA>

We used 0 and -1 to account for frequencies of OTHERS and NA. If you prefer a different priority order, change the values accordingly.

Upvotes: 2

Related Questions