Reputation: 519
I have data which looks like this (but for 14 million people)
It is each person's value for MB in a given quarter (qtr). If qtr == 20091001, it is the period from October to December 2009.
df <- data.frame(
ID = rep(c('ABC1234', 'CED6723', 'GHB9876', "MNR4444", "FRE9823"), 4),
qtr = c(rep('20090101', 5), rep('20090401', 5),rep('20090701', 5),rep('20091001', 5)),
MB = c('0000000', "1234567", "5678910", "1234567",
"9384756", "3456789", NA, '0000000',
"7394857", '0000000', '0000000', '0000000',
"9485967", "9485967", "9485967", '0000000',
'0000000', NA, '4545455', '1987656'
)
)
What I want to do is replace the value of MB where it is NA or == '0000000' with the most recent value of MB.
The rules would be: If MB is NA or 0 in 20091001, I want the value to come from 20090701 (if populated), then 20090401 (if populated), then 20090101.
If MB is NA or 0 in 20090701, I want the value to come from 20090401 (if populated), then 20090101 (if populated), then 20091001.
If MB is NA or 0 in 20090401, I want the value to come from 20090101 (if populated), then 20090701 (if populated), then 20091001.
If MB is NA or 0 in 20090101, I want the value to come from 20090401 (if populated), then 20090701 (if populated), then 20091001.
My ideal output looks like this
df2 <- data.frame(
ID = rep(c('ABC1234', 'CED6723', 'GHB9876', "MNR4444", "FRE9823"), 4),
qtr = c(rep('20090101', 5), rep('20090401', 5),rep('20090701', 5),rep('20091001', 5)),
MB = c("1234567", "1234567", "5678910", "1234567",
"9384756", "3456789", "3456789", "3456789",
"7394857", "7394857", "7394857", "7394857",
"9485967", "9485967", "9485967", "9485967",
'4545455', '4545455', '4545455', '1987656'
)
)
I have seen this response r - copy missing values from other variables This is for data which is in wide format. Do I need to convert my data to wide format, and change the labels of the qtr variable to 1, 2, 3, 4 ? Or is there a way to do this on my dataframe as it is?
Thanks.
Upvotes: 0
Views: 49
Reputation: 389055
We can use zoo::na.locf
assuming the data is ordered by qtr
as shown in the shared data.
df$MB <- as.numeric(as.character(df$MB))
df$MB <- replace(df$MB, df$MB == 0, NA)
df$MB <- zoo::na.locf(df$MB, fromLast = TRUE)
df
# ID qtr MB
#1 ABC1234 20090101 1234567
#2 CED6723 20090101 1234567
#3 GHB9876 20090101 5678910
#4 MNR4444 20090101 1234567
#5 FRE9823 20090101 9384756
#6 ABC1234 20090401 3456789
#7 CED6723 20090401 7394857
#8 GHB9876 20090401 7394857
#9 MNR4444 20090401 7394857
#10 FRE9823 20090401 9485967
#11 ABC1234 20090701 9485967
#12 CED6723 20090701 9485967
#13 GHB9876 20090701 9485967
#14 MNR4444 20090701 9485967
#15 FRE9823 20090701 9485967
#16 ABC1234 20091001 4545455
#17 CED6723 20091001 4545455
#18 GHB9876 20091001 4545455
#19 MNR4444 20091001 4545455
#20 FRE9823 20091001 1987656
Upvotes: 0
Reputation: 1261
So you can do that by using tidyr function fill, but first I checked if the value is zeros then I made it NA because fill focuses on NA values only as follows;
library(dplyr)
library(tidyr)
# Creating dataframe
df <- data.frame(
ID = rep(c('ABC1234', 'CED6723', 'GHB9876', "MNR4444", "FRE9823"), 4),
qtr = c(rep('20090101', 5), rep('20090401', 5),rep('20090701', 5),rep('20091001', 5)),
MB = c('0000000', "1234567", "5678910", "1234567",
"9384756", "3456789", NA, '0000000',
"7394857", '0000000', '0000000', '0000000',
"9485967", "9485967", "9485967", '0000000',
'0000000', NA, '4545455', '1987656'),
stringsAsFactors = FALSE
)
df2 <-
df %>%
# If MB is zeros then convert it into NA
mutate(MB = ifelse(as.numeric(MB) == 0, NA_character_, MB)) %>%
# Fill automatically fills NA data with closest values
# given the direction of the filling
# you can change the direction as you like
# To know more about it ?fill
fill(MB, .direction = "downup")
df2
# ID qtr MB
# ABC1234 20090101 1234567
# CED6723 20090101 1234567
# GHB9876 20090101 5678910
# MNR4444 20090101 1234567
# FRE9823 20090101 9384756
# ABC1234 20090401 3456789
# CED6723 20090401 3456789
# GHB9876 20090401 3456789
# MNR4444 20090401 7394857
# FRE9823 20090401 7394857
Upvotes: 1