Reputation: 130
I have a data frame where every single SerialNo
has a missing value row and some may have filled data. I need to expand the NA
rows to two new rows consisting of the unique values of PartsUsed
.
I believe that I'm on the right track by using dplyr
complete
or expand
but I can't seem to get the arguments correct.
Below are example datasets (one that I currently have and one that is the target df)
begin_ex_df <- tibble(
SerialNo = c(1234, 5678, 5678, 5678, 9012, 1357, 1357, 1357, 2468, 8080),
PartsUsed = c(NA, "A", "B", NA, NA, "A", "B", NA, NA, NA),
Values = c(NA, 10, 15, NA, NA, 11, 14, NA, NA, NA)
)
ending_ex_df <- tibble(
SerialNo = c(1234, 1234, 5678, 5678, 5678, 5678, 9012, 9012, 1357, 1357, 1357, 1357, 2468, 2468, 8080, 8080),
PartsUsed = c("A", "B", "A", "B", "A", "B", "A", "B", "A", "B", "A", "B", "A", "B", "A", "B"),
Values = c(0, 0, 10, 15, 0, 0, 0, 0, 11, 14, 0, 0, 0, 0, 0, 0)
)
I expect every NA
row to be expanded to have an "A"
and "B"
in the PartsUsed
and it's Values
filled to 0 (Note that filling the value is not necessary - it can remain NA
.) The rows that already have an "A"
or "B"
should be left alone.
Upvotes: 3
Views: 120
Reputation: 2987
Here is an alternative, which will keep duplicated SerialNo's
x <- subset(begin_ex_df, is.na(begin_ex_df$PartsUsed))
x <- expand.grid(SerialNo = x$SerialNo, PartsUsed = c("A", "B"))
x$Values <- 0
y <- subset(begin_ex_df, !is.na(begin_ex_df$PartsUsed))
new_df <- rbind(x, y)
new_df[order(new_df$SerialNo),]
# SerialNo PartsUsed Values
#1 1234 A 0
#2 1234 B 0
#3 1357 A 0
#4 1357 B 0
#5 1357 A 11
#6 1357 B 14
#7 2468 A 0
#8 2468 B 0
#9 5678 A 0
#10 5678 B 0
#11 5678 A 10
#12 5678 B 15
#13 8080 A 0
#14 8080 B 0
#15 9012 A 0
#16 9012 B 0
If you want to use tidyr
complete
then you can group_by
creating an index for the groups with NA
and without, and then drop NA
at the end (you can also remove the Ind
column).
library(dplyr)
library(tidyr)
begin_ex_df %>%
group_by(SerialNo, Ind = if_else(is.na(PartsUsed), "no", "yes")) %>%
complete(., SerialNo, PartsUsed = c("A", "B"), fill = list(Values = 0)) %>%
na.omit()
# Ind SerialNo PartsUsed Values
# <chr> <dbl> <chr> <dbl>
# 1 no 1234 A 0
# 2 no 1234 B 0
# 3 no 1357 A 0
# 4 no 1357 B 0
# 5 yes 1357 A 11
# 6 yes 1357 B 14
# 7 no 2468 A 0
# 8 no 2468 B 0
# 9 no 5678 A 0
#10 no 5678 B 0
#11 yes 5678 A 10
#12 yes 5678 B 15
#13 no 8080 A 0
#14 no 8080 B 0
#15 no 9012 A 0
#16 no 9012 B 0
Upvotes: 3
Reputation: 7592
complete
is from tidyr
, not dplyr
, but it's certainly a good direction to take. It does need to be combined with filter
from dplyr
to remove the NA lines created.
tidyr::complete(df,SerialNo,PartsUsed,fill=list(Values=0)) %>%
dplyr::filter(!is.na(PartsUsed))
# A tibble: 12 x 3
SerialNo PartsUsed Values
<dbl> <chr> <dbl>
1 1234 A 0
2 1234 B 0
3 1357 A 11
4 1357 B 14
5 2468 A 0
6 2468 B 0
7 5678 A 10
8 5678 B 15
9 8080 A 0
10 8080 B 0
11 9012 A 0
12 9012 B 0
Upvotes: 2