Duck
Duck

Reputation: 39595

Fill missing values with previous values by row using dplyr

I am working with a dataframe in R which has some missing values across rows. Data frame is next (dput added in the end):

df
  id V1 V2 V3 V4
1 01  1  1  1 NA
2 02  2  1 NA NA
3 03  3  1 NA NA
4 04  4  1  2 NA

Each row is a different id. As you can see the rows have missing values. I would like to know how can I get a dataframe completed in this style without using reshape to long or pivot as my real data is very large:

df
  id V1 V2 V3 V4
1 01  1  1  1  1
2 02  2  1  1  1
3 03  3  1  1  1
4 04  4  1  2  2

I was trying to use fill from tidyr but at row level I am having issues. I have seen some posts where it is used along with dplyr function across but I can not find it. I have tried using group_by(id) and rowwise but I have not had success. Also only the variables/columns starting with V should be filled with previous values.

Data is next:

#Data
df <- structure(list(id = c("01", "02", "03", "04"), V1 = c(1, 2, 3, 
4), V2 = c(1, 1, 1, 1), V3 = c(1, NA, NA, 2), V4 = c(NA, NA, 
NA, NA)), class = "data.frame", row.names = c(NA, -4L))

Many thanks for your time.

Upvotes: 9

Views: 3727

Answers (5)

Anoushiravan R
Anoushiravan R

Reputation: 21908

One solution could be using na.locf function from package zoo combining with purrr::pmap function in a row-wise operation. na.locf takes the most recent non-NA value and replace all the upcoming NA values by that. Just as a reminder c(...) in both solutions captures all values of V1:V4 in each row in every iteration. However, I excluded id column in both as it is not involved in the our calculations.

library(zoo)
library(purrr)

df %>%
  mutate(pmap_df(., ~ na.locf(c(...)[-1])))

  id V1 V2 V3 V4
1 01  1  1  1  1
2 02  2  1  1  1
3 03  3  1  1  1
4 04  4  1  2  2

Or we can use coalesce function from dplyr. We can replace every NA values in each row with the last non-NA value, something we did earlier with na.locf. However this solution is a bit verbose:

df %>%
  mutate(pmap_df(., ~ {x <- c(...)[!is.na(c(...))]; 
  coalesce(c(...), x[length(x)])}))

  id V1 V2 V3 V4
1 01  1  1  1  1
2 02  2  1  1  1
3 03  3  1  1  1
4 04  4  1  2  2

Or you could also use this:

library(purrr)

df %>%
  mutate(across(!id, ~ replace(., is.na(.), invoke(coalesce, rev(df[-1])))))

  id V1 V2 V3 V4
1 01  1  1  1  1
2 02  2  1  1  1
3 03  3  1  1  1
4 04  4  1  2  2

The warning message can be ignored. It is in fact produced because we have 6 NA values but the result of applying dplyr::coalesce on every vector is 1 element resulting in 4 elements to replace 6 slots.

Upvotes: 10

G. Grothendieck
G. Grothendieck

Reputation: 269586

If the reason you want to avoid reshaping is to save runtime then that idea is actually mistaken if the benchmark below continues to hold at scale. Note that f which transposes, uses na.locf and then transposes back is the fastest.

library(microbenchmark)
library(data.table)
library(dplyr)
library(purrr)
library(zoo)

microbenchmark(times = 10,
  a = df %>% mutate(pmap_df(., ~ na.locf(c(...)[-1]))),
  b = df %>%
    mutate(pmap_df(., ~ {x <- c(...)[!is.na(c(...))]; 
    coalesce(c(...), x[length(x)])})),
  c = df %>%
    mutate(across(-id, ~ ifelse(is.na(.), coalesce(!!!select(., V4:V1)), .))),
  d = df %>% mutate(across(V1:V4, ~ coalesce(., tail(cur_data()[-1][!is.na(cur_data()[-1])],1)))),
  e = as.data.table(df)[, setNames(as.list(nafill(unlist(.SD), type = "locf")), names(.SD)), id],
  f = data.frame(id = df$id, t(na.locf(t(df[-1])))))

giving:

Unit: milliseconds
 expr       min        lq      mean    median        uq       max neval
    a 11.343302 12.934702 15.032001 13.115151 14.799400 30.135901    10
    b 11.641301 13.116401 14.030551 14.426751 15.012701 15.517501    10
    c 28.201501 30.470801 33.375761 32.672950 36.671101 40.448701    10
    d 25.394901 26.648801 30.044331 27.971251 32.433801 39.570600    10
    e  3.750801  4.023700  8.771401  4.150701  4.367502 50.636700    10
    f  2.454701  2.458201  3.009181  2.603951  2.952302  6.126101    10

Upvotes: 5

ThomasIsCoding
ThomasIsCoding

Reputation: 101335

A data.table option with nafill

> setDT(df)[, setNames(as.list(nafill(unlist(.SD), type = "locf")), names(.SD)), id]
   id V1 V2 V3 V4
1: 01  1  1  1  1
2: 02  2  1  1  1
3: 03  3  1  1  1
4: 04  4  1  2  2

Upvotes: 4

AnilGoyal
AnilGoyal

Reputation: 26218

A dplyr approach

df <- structure(list(id = c("01", "02", "03", "04"), V1 = c(1, 2, 3, 
                                                            4), V2 = c(1, 1, 1, 1), V3 = c(1, NA, NA, 2), V4 = c(NA, NA, 
                                                                                                                 NA, NA)), class = "data.frame", row.names = c(NA, -4L))


library(dplyr, warn.conflicts = F)

df %>% mutate(across(V1:V4, ~ coalesce(., tail(cur_data()[-1][!is.na(cur_data()[-1])],1))))
#>   id V1 V2 V3 V4
#> 1 01  1  1  1  2
#> 2 02  2  1  2  2
#> 3 03  3  1  2  2
#> 4 04  4  1  2  2

If you'll group_by on id column, you won't have to use [-1] on cur_data()`

df %>% group_by(id) %>%
  mutate(across(V1:V4, ~ coalesce(., tail(cur_data()[!is.na(cur_data())],1))))

Upvotes: 5

tmfmnk
tmfmnk

Reputation: 39858

One option using dplyr could be:

df %>%
 mutate(across(-id, ~ ifelse(is.na(.), coalesce(!!!select(., V4:V1)), .)))

  id V1 V2 V3 V4
1  1  1  1  1  1
2  2  2  1  1  1
3  3  3  1  1  1
4  4  4  1  2  2

Upvotes: 8

Related Questions