MoRA
MoRA

Reputation: 45

Replacing missing values by the most frequent one based on the values of other two variables

I am working with R and have a dataset which is comprised of three variables: e.g. A, B and C. Variable C has some NA observations which I wish to replace by the most frequent value of C with the same A and B quantities. As an example, in the following dataset:

   A B  C
1  1 2  0
2  2 1  1
3  1 1  1
4  3 1  1
5  1 2  0
6  1 2  0
7  2 3  0
8  1 2  1
9  3 3  0
10 1 2 NA

Here, I would like to replace NA by 0 since it is the most frequent value of C when A=1 and B=2.
I know it can be done if I write a function to obtain frequencies and the corresponding values, however, I was wondering if there are less complicated ways?

Upvotes: 2

Views: 1156

Answers (2)

lebatsnok
lebatsnok

Reputation: 6459

base R

(sorry for a very long line)

unsplit(lapply(split(df, list(df$A, df$B), drop=TRUE), function(.) {.$C[is.na(.$C)] <- names(which.max(table(.$C)));.}), interaction(df$A, df$B, drop = TRUE))

# output
   A B C
1  1 2 0
2  2 1 1
3  1 1 1
4  3 1 1
5  1 2 0
6  1 2 0
7  2 3 0
8  1 2 1
9  3 3 0
10 1 2 0

Upvotes: 0

Maurits Evers
Maurits Evers

Reputation: 50678

A tidyverse option

library(tidyverse)
df %>%
    group_by(A, B) %>%
    add_count(C) %>%
    mutate(C = if_else(is.na(C), C[which.max(n)], C)) %>%
    select(-n) %>%
    ungroup()
# A tibble: 10 x 3
       A     B     C
   <int> <int> <int>
 1     1     2     0
 2     2     1     1
 3     1     1     1
 4     3     1     1
 5     1     2     0
 6     1     2     0
 7     2     3     0
 8     1     2     1
 9     3     3     0
10     1     2     0

Explanation: Group entries by A and B, add a count for every C, replace NA values in C with the most frequent non-NA value in C, and tidy-up the tibble to reproduce the expected output.


Sample data

df <- read.table(text =
    "   A B  C
1  1 2  0
2  2 1  1
3  1 1  1
4  3 1  1
5  1 2  0
6  1 2  0
7  2 3  0
8  1 2  1
9  3 3  0
10 1 2 NA
", header = T)

Upvotes: 1

Related Questions