Replacing missing values by the most frequent one based on the values of other two variables

Question

I am working with R and have a dataset which is comprised of three variables: e.g. A, B and C. Variable C has some NA observations which I wish to replace by the most frequent value of C with the same A and B quantities. As an example, in the following dataset:

Here, I would like to replace NA by 0 since it is the most frequent value of C when A=1 and B=2.
I know it can be done if I write a function to obtain frequencies and the corresponding values, however, I was wondering if there are less complicated ways?

Maurits Evers · Accepted Answer

A tidyverse option

library(tidyverse)
df %>%
    group_by(A, B) %>%
    add_count(C) %>%
    mutate(C = if_else(is.na(C), C[which.max(n)], C)) %>%
    select(-n) %>%
    ungroup()
# A tibble: 10 x 3
       A     B     C
     
 1     1     2     0
 2     2     1     1
 3     1     1     1
 4     3     1     1
 5     1     2     0
 6     1     2     0
 7     2     3     0
 8     1     2     1
 9     3     3     0
10     1     2     0

Explanation: Group entries by A and B, add a count for every C, replace NA values in C with the most frequent non-NA value in C, and tidy-up the tibble to reproduce the expected output.

Sample data

df <- read.table(text =
    "   A B  C
1  1 2  0
2  2 1  1
3  1 1  1
4  3 1  1
5  1 2  0
6  1 2  0
7  2 3  0
8  1 2  1
9  3 3  0
10 1 2 NA
", header = T)

Replacing missing values by the most frequent one based on the values of other two variables

Answers (2)

base R

Sample data

Related Questions