Reputation: 39595
I am working with a dataframe in R which has some missing values across rows. Data frame is next (dput
added in the end):
df
id V1 V2 V3 V4
1 01 1 1 1 NA
2 02 2 1 NA NA
3 03 3 1 NA NA
4 04 4 1 2 NA
Each row is a different id
. As you can see the rows have missing values. I would like to know how can I get a dataframe completed in this style without using reshape to long or pivot as my real data is very large:
df
id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2
I was trying to use fill
from tidyr
but at row level I am having issues. I have seen some posts where it is used along with dplyr
function across
but I can not find it. I have tried using group_by(id)
and rowwise
but I have not had success. Also only the variables/columns starting with V
should be filled with previous values.
Data is next:
#Data
df <- structure(list(id = c("01", "02", "03", "04"), V1 = c(1, 2, 3,
4), V2 = c(1, 1, 1, 1), V3 = c(1, NA, NA, 2), V4 = c(NA, NA,
NA, NA)), class = "data.frame", row.names = c(NA, -4L))
Many thanks for your time.
Upvotes: 9
Views: 3727
Reputation: 21908
One solution could be using na.locf
function from package zoo
combining with purrr::pmap
function in a row-wise operation. na.locf
takes the most recent non-NA
value and replace all the upcoming NA
values by that. Just as a reminder c(...)
in both solutions captures all values of V1:V4
in each row in every iteration. However, I excluded id
column in both as it is not involved in the our calculations.
library(zoo)
library(purrr)
df %>%
mutate(pmap_df(., ~ na.locf(c(...)[-1])))
id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2
Or we can use coalesce
function from dplyr
. We can replace every NA
values in each row with the last non-NA
value, something we did earlier with na.locf
. However this solution is a bit verbose:
df %>%
mutate(pmap_df(., ~ {x <- c(...)[!is.na(c(...))];
coalesce(c(...), x[length(x)])}))
id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2
Or you could also use this:
library(purrr)
df %>%
mutate(across(!id, ~ replace(., is.na(.), invoke(coalesce, rev(df[-1])))))
id V1 V2 V3 V4
1 01 1 1 1 1
2 02 2 1 1 1
3 03 3 1 1 1
4 04 4 1 2 2
The warning message can be ignored. It is in fact produced because we have 6 NA
values but the result of applying dplyr::coalesce
on every vector is 1 element resulting in 4 elements to replace 6 slots.
Upvotes: 10
Reputation: 269586
If the reason you want to avoid reshaping is to save runtime then that idea is actually mistaken if the benchmark below continues to hold at scale. Note that f which transposes, uses na.locf and then transposes back is the fastest.
library(microbenchmark)
library(data.table)
library(dplyr)
library(purrr)
library(zoo)
microbenchmark(times = 10,
a = df %>% mutate(pmap_df(., ~ na.locf(c(...)[-1]))),
b = df %>%
mutate(pmap_df(., ~ {x <- c(...)[!is.na(c(...))];
coalesce(c(...), x[length(x)])})),
c = df %>%
mutate(across(-id, ~ ifelse(is.na(.), coalesce(!!!select(., V4:V1)), .))),
d = df %>% mutate(across(V1:V4, ~ coalesce(., tail(cur_data()[-1][!is.na(cur_data()[-1])],1)))),
e = as.data.table(df)[, setNames(as.list(nafill(unlist(.SD), type = "locf")), names(.SD)), id],
f = data.frame(id = df$id, t(na.locf(t(df[-1])))))
giving:
Unit: milliseconds
expr min lq mean median uq max neval
a 11.343302 12.934702 15.032001 13.115151 14.799400 30.135901 10
b 11.641301 13.116401 14.030551 14.426751 15.012701 15.517501 10
c 28.201501 30.470801 33.375761 32.672950 36.671101 40.448701 10
d 25.394901 26.648801 30.044331 27.971251 32.433801 39.570600 10
e 3.750801 4.023700 8.771401 4.150701 4.367502 50.636700 10
f 2.454701 2.458201 3.009181 2.603951 2.952302 6.126101 10
Upvotes: 5
Reputation: 101335
A data.table
option with nafill
> setDT(df)[, setNames(as.list(nafill(unlist(.SD), type = "locf")), names(.SD)), id]
id V1 V2 V3 V4
1: 01 1 1 1 1
2: 02 2 1 1 1
3: 03 3 1 1 1
4: 04 4 1 2 2
Upvotes: 4
Reputation: 26218
A dplyr
approach
df <- structure(list(id = c("01", "02", "03", "04"), V1 = c(1, 2, 3,
4), V2 = c(1, 1, 1, 1), V3 = c(1, NA, NA, 2), V4 = c(NA, NA,
NA, NA)), class = "data.frame", row.names = c(NA, -4L))
library(dplyr, warn.conflicts = F)
df %>% mutate(across(V1:V4, ~ coalesce(., tail(cur_data()[-1][!is.na(cur_data()[-1])],1))))
#> id V1 V2 V3 V4
#> 1 01 1 1 1 2
#> 2 02 2 1 2 2
#> 3 03 3 1 2 2
#> 4 04 4 1 2 2
If you'll group_by
on id
column, you won't have to use [-1] on
cur_data()`
df %>% group_by(id) %>%
mutate(across(V1:V4, ~ coalesce(., tail(cur_data()[!is.na(cur_data())],1))))
Upvotes: 5
Reputation: 39858
One option using dplyr
could be:
df %>%
mutate(across(-id, ~ ifelse(is.na(.), coalesce(!!!select(., V4:V1)), .)))
id V1 V2 V3 V4
1 1 1 1 1 1
2 2 2 1 1 1
3 3 3 1 1 1
4 4 4 1 2 2
Upvotes: 8