Nettle
Nettle

Reputation: 3321

Identify interrupted observations

I would like to identify missing observations that suggest cleaning/data errors.

My dataframe consists of many accounts over many years. Here are the rules it follows:

In the data below, Accounts A-E show an amount for years 2001-2004.

df <- tribble(
       ~account,    ~"2001",     ~"2002",    ~"2003",  ~"2004",
       "Account.A",     100,          90,         87,       80,  #<Good   
       "Account.B",       0,          20,         30,       33,  #<Good
       "Account.C",      50,          55,          0,       0,   #<Good
       "Account.D",     200,         210,         NA,       210, #<Bad
       "Account.E",     150,           0,        212,       211) #<Bad

Account A,B,C show good data:

Account D and E show bad data:

My goal is to identify interrupted lines (D,E) and tag them.

I would like a solution that could be generalized across many years and thousands of accounts.

Upvotes: 1

Views: 55

Answers (1)

sbha
sbha

Reputation: 10422

Here is a tidyverse option that might not be the prettiest, but should do the trick:

library(tidyverse)
df %>% 
  gather(year, value, `2001`:`2004`) %>% 
  group_by(account) %>% 
  mutate(order = if_else(year == min(year), 'first', 
                         if_else(year == max(year), 'last', 'mid'))) %>% 
  mutate(value = replace(value, is.na(value), 0)) %>% 
  mutate(start0 = row_number() >= min(row_number()[value != 0]),
         end0 = row_number() <= max(row_number()[value != 0])) %>% 
  mutate(check = if_else(order == 'mid' & value == 0 & start0 == TRUE & end0 == TRUE, TRUE, FALSE)) %>% 
  filter(check == TRUE)

# A tibble: 2 x 7
# Groups:   account [2]
  account   year  value order start0 end0  check
  <chr>     <chr> <dbl> <chr> <lgl>  <lgl> <lgl>
1 Account.E 2002      0 mid   TRUE   TRUE  TRUE 
2 Account.D 2003      0 mid   TRUE   TRUE  TRUE 

Here's an explanation:

  • Convert the data from wide to long.
  • Determine, by group, whether an account entry is a first, middle, or last entry in its history.
  • Because zeroes and NAs are treated the same, NAs are replaced with zeros to make this a little easier to work with, but they could be kept as is and the code updated to handle them.
  • TRUE/FALSE columns are added for whether a 0 value sequence runs from either the start or the end of the account history.
  • If an account is 0, not a first or last entry, and not the part of a 0 sequence run from the beginning or end of the account history, the account is marked TRUE to check.
  • Finally, there is a filter to only the accounts that need to be checked.

Upvotes: 1

Related Questions