Reputation: 519

Populate the missing values from a variable with values from the closest point in time, in R

I have data which looks like this (but for 14 million people)

It is each person's value for MB in a given quarter (qtr). If qtr == 20091001, it is the period from October to December 2009.

df <- data.frame(
  ID = rep(c('ABC1234', 'CED6723', 'GHB9876', "MNR4444", "FRE9823"), 4),
  qtr = c(rep('20090101', 5), rep('20090401', 5),rep('20090701', 5),rep('20091001', 5)),
  MB = c('0000000', "1234567", "5678910", "1234567", 
         "9384756", "3456789", NA, '0000000',
         "7394857", '0000000', '0000000', '0000000',
         "9485967", "9485967", "9485967", '0000000',
         '0000000', NA, '4545455', '1987656'
         )
)

What I want to do is replace the value of MB where it is NA or == '0000000' with the most recent value of MB.

The rules would be: If MB is NA or 0 in 20091001, I want the value to come from 20090701 (if populated), then 20090401 (if populated), then 20090101.

If MB is NA or 0 in 20090701, I want the value to come from 20090401 (if populated), then 20090101 (if populated), then 20091001.

If MB is NA or 0 in 20090401, I want the value to come from 20090101 (if populated), then 20090701 (if populated), then 20091001.

If MB is NA or 0 in 20090101, I want the value to come from 20090401 (if populated), then 20090701 (if populated), then 20091001.

My ideal output looks like this

df2 <- data.frame(
  ID = rep(c('ABC1234', 'CED6723', 'GHB9876', "MNR4444", "FRE9823"), 4),
  qtr = c(rep('20090101', 5), rep('20090401', 5),rep('20090701', 5),rep('20091001', 5)),
  MB = c("1234567", "1234567", "5678910", "1234567", 
         "9384756", "3456789", "3456789", "3456789",
         "7394857", "7394857", "7394857", "7394857",
         "9485967", "9485967", "9485967", "9485967",
         '4545455', '4545455', '4545455', '1987656'
  )
)

I have seen this response r - copy missing values from other variables This is for data which is in wide format. Do I need to convert my data to wide format, and change the labels of the qtr variable to 1, 2, 3, 4 ? Or is there a way to do this on my dataframe as it is?

Thanks.

Upvotes: 0

Answers (2)

Ronak Shah

Reputation: 389055

We can use zoo::na.locf assuming the data is ordered by qtr as shown in the shared data.

df$MB <- as.numeric(as.character(df$MB))
df$MB <- replace(df$MB, df$MB == 0, NA)
df$MB <- zoo::na.locf(df$MB, fromLast = TRUE)

df
#        ID      qtr      MB
#1  ABC1234 20090101 1234567
#2  CED6723 20090101 1234567
#3  GHB9876 20090101 5678910
#4  MNR4444 20090101 1234567
#5  FRE9823 20090101 9384756
#6  ABC1234 20090401 3456789
#7  CED6723 20090401 7394857
#8  GHB9876 20090401 7394857
#9  MNR4444 20090401 7394857
#10 FRE9823 20090401 9485967
#11 ABC1234 20090701 9485967
#12 CED6723 20090701 9485967
#13 GHB9876 20090701 9485967
#14 MNR4444 20090701 9485967
#15 FRE9823 20090701 9485967
#16 ABC1234 20091001 4545455
#17 CED6723 20091001 4545455
#18 GHB9876 20091001 4545455
#19 MNR4444 20091001 4545455
#20 FRE9823 20091001 1987656

Upvotes: 0

Nareman Darwish

Reputation: 1261

So you can do that by using tidyr function fill, but first I checked if the value is zeros then I made it NA because fill focuses on NA values only as follows;

library(dplyr)
library(tidyr)

# Creating dataframe
df <- data.frame(
  ID = rep(c('ABC1234', 'CED6723', 'GHB9876', "MNR4444", "FRE9823"), 4),
  qtr = c(rep('20090101', 5), rep('20090401', 5),rep('20090701', 5),rep('20091001', 5)),
  MB = c('0000000', "1234567", "5678910", "1234567", 
         "9384756", "3456789", NA, '0000000',
         "7394857", '0000000', '0000000', '0000000',
         "9485967", "9485967", "9485967", '0000000',
         '0000000', NA, '4545455', '1987656'),
  stringsAsFactors = FALSE
)

df2 <-
  df %>%
  # If MB is zeros then convert it into NA
  mutate(MB = ifelse(as.numeric(MB) == 0, NA_character_, MB)) %>%
  # Fill automatically fills NA data with closest values
  # given the direction of the filling
  # you can change the direction as you like
  # To know more about it ?fill
  fill(MB, .direction = "downup")

df2

# ID      qtr      MB
# ABC1234 20090101 1234567
# CED6723 20090101 1234567
# GHB9876 20090101 5678910
# MNR4444 20090101 1234567
# FRE9823 20090101 9384756
# ABC1234 20090401 3456789
# CED6723 20090401 3456789
# GHB9876 20090401 3456789
# MNR4444 20090401 7394857
# FRE9823 20090401 7394857

Upvotes: 1

Populate the missing values from a variable with values from the closest point in time, in R

Answers (2)

Related Questions