Identify interrupted observations

Question

I would like to identify missing observations that suggest cleaning/data errors.

My dataframe consists of many accounts over many years. Here are the rules it follows:

Accounts may be created or terminated. In these cases, the amount is either $0 or NA. Such observations are (probably) not the result of bad data.
Accounts interrupted by a NA or $0 are probably the result of bad data or cleaning errors.

In the data below, Accounts A-E show an amount for years 2001-2004.

df <- tribble(
       ~account,    ~"2001",     ~"2002",    ~"2003",  ~"2004",
       "Account.A",     100,          90,         87,       80,  #



Account A,B,C show good data:


Account A shows uninterrupted data
Account B shows an account that began in 2002.
Account C shows an account that expired in 2003 and remained $0 ever after.


Account D and E show bad data:


Account D shows an account interupted in 2003
Account E shows an account interrupted in 2002


My goal is to identify interrupted lines (D,E) and tag them. 

I would like a solution that could be generalized across many years and thousands of accounts.

sbha · Accepted Answer

Here is a tidyverse option that might not be the prettiest, but should do the trick:

library(tidyverse)
df %>% 
  gather(year, value, `2001`:`2004`) %>% 
  group_by(account) %>% 
  mutate(order = if_else(year == min(year), 'first', 
                         if_else(year == max(year), 'last', 'mid'))) %>% 
  mutate(value = replace(value, is.na(value), 0)) %>% 
  mutate(start0 = row_number() >= min(row_number()[value != 0]),
         end0 = row_number() <= max(row_number()[value != 0])) %>% 
  mutate(check = if_else(order == 'mid' & value == 0 & start0 == TRUE & end0 == TRUE, TRUE, FALSE)) %>% 
  filter(check == TRUE)

# A tibble: 2 x 7
# Groups:   account [2]
  account   year  value order start0 end0  check
             
1 Account.E 2002      0 mid   TRUE   TRUE  TRUE 
2 Account.D 2003      0 mid   TRUE   TRUE  TRUE

Here's an explanation:

Convert the data from wide to long.
Determine, by group, whether an account entry is a first, middle, or last entry in its history.
Because zeroes and NAs are treated the same, NAs are replaced with zeros to make this a little easier to work with, but they could be kept as is and the code updated to handle them.
TRUE/FALSE columns are added for whether a 0 value sequence runs from either the start or the end of the account history.
If an account is 0, not a first or last entry, and not the part of a 0 sequence run from the beginning or end of the account history, the account is marked TRUE to check.
Finally, there is a filter to only the accounts that need to be checked.

Identify interrupted observations

Answers (1)

Related Questions