Reputation: 183
I am just going to give it a try as I know that here are some smart people who might have an r -code for this. I wont be able to code this by myself.
So I got a dataset that contains the names and years-months between 2000-01 and 2008-12. Looking like this:
Name Date
A 2000-01
A 2000-02
A ...
A 2008-12
A 2000-01
B 2000-01
B ...
B 2008-12
C and so on..
It can happen that for each name in my key column there is one value for each year. Thats the best I can ask for. Unfortunately some years dont have a value in my key column. Getting further in my dataset looking only at Name A:
So if I do not have 1 observations for every year between 2000-2008 and I want to get the row from the year and month that does not have a value for my key column based on the month from the year on the next observation. In this example:
2003-02 has a value for my keycolumn and 2002-02 does not, I want to get back the row from the date 2002-02 and Name A. In a nutshell: "Keeping rows from the previous year based on key column from the next year"
Is there some easy way to code this? Thank you :)
Upvotes: 0
Views: 47
Reputation: 13128
There's no straightforward and easy way to code what you're describing, but it's certainly possible to break the problem down into easier parts. The core part of the problem is as follows. Given a dataframe of rows with non-NA values, e.g.
year month
1 2002 12
2 2005 11
3 2006 01
4 2008 07
for each row, check the dataframe to see if the previous year exists; if yes, return the row, if no, return an additional row with the previous year and the same month. Here's what a function to do that might look like
check_ym <- function(y, m, dat) {
if ((y - 1) %in% dat$year) {
return(data.frame(Date = paste(y, m, sep = "-"), stringsAsFactors = FALSE))
} else {
return(data.frame(Date = paste(c(y - 1, y), c(m, m), sep = "-"), stringsAsFactors = FALSE))
}
}
Now, let's make some fake data.
library(dplyr)
library(tidyr)
library(purrr)
# Simulate data
set.seed(123)
x <- data.frame(Date = paste(sample(2000:2008, 4),
sprintf("%02d", sample(1:12, 4, replace = TRUE)),
sep = "-"),
KeyColumn = floor(runif(4, 1, 10)))
d <- data.frame(Date = paste(rep(2000:2008, each = 12),
sprintf("%02d", rep(1:12, times = 9)),
sep = "-")) %>%
left_join(x)
Identify the non-NA rows:
dd <- d %>%
na.omit() %>%
separate(Date, into = c("year", "month")) %>%
mutate(year = as.numeric(year))
dd
# year month KeyColumn
# 1 2002 12 5
# 2 2005 11 5
# 3 2006 01 5
# 4 2008 07 9
Then, we run the function above, iterating through the year
and month
columns. This gives us
out <- map2_df(dd$year, dd$month, .f = check_ym, dat = dd)
out
# Date
# 1 2001-12
# 2 2002-12
# 3 2004-11
# 4 2005-11
# 5 2006-01
# 6 2007-07
# 7 2008-07
Finally, we join this with our original data:
inner_join(out, d)
# Joining, by = "Date"
# Date KeyColumn
# 1 2001-12 NA
# 2 2002-12 5
# 3 2004-11 NA
# 4 2005-11 5
# 5 2006-01 5
# 6 2007-07 NA
# 7 2008-07 9
This is just for one Name
. We can also do this for many Name
s. First create some fake data:
# Simulate data
set.seed(123)
d <- map_df(setNames(1:3, LETTERS[1:3]), function(...) {
x <- data.frame(Date = paste(sample(2000:2008, 4),
sprintf("%02d", sample(1:12, 4, replace = TRUE)),
sep = "-"),
KeyColumn = floor(runif(4, 1, 10)))
data.frame(Date = paste(rep(2000:2008, each = 12),
sprintf("%02d", rep(1:12, times = 9)),
sep = "-")) %>%
left_join(x)
}, .id = "Name")
dd <- d %>%
na.omit() %>%
separate(Date, into = c("year", "month")) %>%
mutate(year = as.numeric(year))
dd
# Name year month KeyColumn
# 1 A 2002 12 5
# 2 A 2005 11 5
# 3 A 2006 01 5
# 4 A 2008 07 9
# 5 B 2000 04 6
# 6 B 2004 01 7
# 7 B 2005 12 9
# 8 B 2006 03 9
# 9 B 2000 04 6
# 10 C 2003 12 1
# 11 C 2005 04 7
# 12 C 2006 11 5
# 13 C 2008 02 8
Now, use split
to split the dataframe into three dataframes by Name
; for each sub-dataframe, we apply check_ym()
, and then we combine the results together and join it with the original data:
lapply(split(dd, dd$Name), function(dat) {
map2_df(dat$year, dat$month, .f = check_ym, dat = dat)
}) %>%
bind_rows(.id = "Name") %>%
inner_join(d)
# Joining, by = c("Name", "Date")
# Name Date KeyColumn
# 1 A 2001-12 NA
# 2 A 2002-12 5
# 3 A 2004-11 NA
# 4 A 2005-11 5
# 5 A 2006-01 5
# 6 A 2007-07 NA
# 7 A 2008-07 9
# 8 B 2000-04 6
# 9 B 2003-01 NA
# 10 B 2004-01 7
# 11 B 2005-12 9
# 12 B 2006-03 9
# 13 C 2002-12 NA
# 14 C 2003-12 1
# 15 C 2004-04 NA
# 16 C 2005-04 7
# 17 C 2006-11 5
# 18 C 2007-02 NA
# 19 C 2008-02 8
Upvotes: 1