Reputation: 45
I am working with R
and have a dataset which is comprised of three variables: e.g. A
, B
and C
. Variable C
has some NA
observations which I wish to replace by the most frequent value of C
with the same A
and B
quantities. As an example, in the following dataset:
A B C
1 1 2 0
2 2 1 1
3 1 1 1
4 3 1 1
5 1 2 0
6 1 2 0
7 2 3 0
8 1 2 1
9 3 3 0
10 1 2 NA
Here, I would like to replace NA
by 0
since it is the most frequent value of C
when A=1
and B=2
.
I know it can be done if I write a function to obtain frequencies and the corresponding values, however, I was wondering if there are less complicated ways?
Upvotes: 2
Views: 1156
Reputation: 6459
(sorry for a very long line)
unsplit(lapply(split(df, list(df$A, df$B), drop=TRUE), function(.) {.$C[is.na(.$C)] <- names(which.max(table(.$C)));.}), interaction(df$A, df$B, drop = TRUE))
# output
A B C
1 1 2 0
2 2 1 1
3 1 1 1
4 3 1 1
5 1 2 0
6 1 2 0
7 2 3 0
8 1 2 1
9 3 3 0
10 1 2 0
Upvotes: 0
Reputation: 50678
A tidyverse
option
library(tidyverse)
df %>%
group_by(A, B) %>%
add_count(C) %>%
mutate(C = if_else(is.na(C), C[which.max(n)], C)) %>%
select(-n) %>%
ungroup()
# A tibble: 10 x 3
A B C
<int> <int> <int>
1 1 2 0
2 2 1 1
3 1 1 1
4 3 1 1
5 1 2 0
6 1 2 0
7 2 3 0
8 1 2 1
9 3 3 0
10 1 2 0
Explanation: Group entries by A
and B
, add a count for every C
, replace NA
values in C
with the most frequent non-NA
value in C
, and tidy-up the tibble
to reproduce the expected output.
df <- read.table(text =
" A B C
1 1 2 0
2 2 1 1
3 1 1 1
4 3 1 1
5 1 2 0
6 1 2 0
7 2 3 0
8 1 2 1
9 3 3 0
10 1 2 NA
", header = T)
Upvotes: 1