Reputation: 3321
I would like to identify missing observations that suggest cleaning/data errors.
My dataframe consists of many accounts over many years. Here are the rules it follows:
In the data below, Accounts A-E show an amount for years 2001-2004.
df <- tribble(
~account, ~"2001", ~"2002", ~"2003", ~"2004",
"Account.A", 100, 90, 87, 80, #<Good
"Account.B", 0, 20, 30, 33, #<Good
"Account.C", 50, 55, 0, 0, #<Good
"Account.D", 200, 210, NA, 210, #<Bad
"Account.E", 150, 0, 212, 211) #<Bad
Account A,B,C show good data:
Account D and E show bad data:
My goal is to identify interrupted lines (D,E) and tag them.
I would like a solution that could be generalized across many years and thousands of accounts.
Upvotes: 1
Views: 55
Reputation: 10422
Here is a tidyverse
option that might not be the prettiest, but should do the trick:
library(tidyverse)
df %>%
gather(year, value, `2001`:`2004`) %>%
group_by(account) %>%
mutate(order = if_else(year == min(year), 'first',
if_else(year == max(year), 'last', 'mid'))) %>%
mutate(value = replace(value, is.na(value), 0)) %>%
mutate(start0 = row_number() >= min(row_number()[value != 0]),
end0 = row_number() <= max(row_number()[value != 0])) %>%
mutate(check = if_else(order == 'mid' & value == 0 & start0 == TRUE & end0 == TRUE, TRUE, FALSE)) %>%
filter(check == TRUE)
# A tibble: 2 x 7
# Groups: account [2]
account year value order start0 end0 check
<chr> <chr> <dbl> <chr> <lgl> <lgl> <lgl>
1 Account.E 2002 0 mid TRUE TRUE TRUE
2 Account.D 2003 0 mid TRUE TRUE TRUE
Here's an explanation:
Upvotes: 1