Jack
Jack

Reputation: 857

How to filter out one variable conditioned on the change in itself and another variable?

I am trying to clean my age variable from data entry discrepancies in a panel data that follow individuals over time. Many respondents have a jump in their age from one observation to another because they have missed a few waves and then came back as we can see for the persons below with ID 1 and 2. However, the person with ID 3 had a jump in age that is not equal to the year that s/he was out of the panel.

Could someone please guide me on how to filter out respondents from my data that have unreasonable change in their age that is not equal to the number of years they were out of the panel but to other reasons such as data entry issues?

id  year    age
1   2005    50
1   2006    51
1   2010    55
2   2002    38
2   2005    41
2   2006    42
3   2006    30
3   2009    38
3   2010    39

structure(list(id = structure(c(1, 1, 1, 2, 2, 2, 3, 3, 3), format.stata = "%9.0g"), 
    year = structure(c(2005, 2006, 2010, 2002, 2005, 2006, 2006, 
    2009, 2010), format.stata = "%9.0g"), age = structure(c(50, 
    51, 55, 38, 41, 42, 30, 38, 39), format.stata = "%9.0g")), row.names = c(NA, 
-9L), class = c("tbl_df", "tbl", "data.frame"))

Upvotes: 1

Views: 87

Answers (2)

akrun
akrun

Reputation: 887691

We can use diff

library(dplyr)
df %>%
    group_by(id) %>% 
    filter(!all(diff(year) == diff(age)))

-output

# A tibble: 3 x 3
# Groups:   id [1]
#     id  year   age
#  <dbl> <dbl> <dbl>
#1     3  2006    30
#2     3  2009    38
#3     3  2010    39

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389175

You can filter out the id's whose change in year and age is not in sync.

library(dplyr)

df %>%
  group_by(id) %>%
  filter(!all(year - min(year) == age - min(age))) -> unreasonable_data

unreasonable_data

#     id  year   age
#  <dbl> <dbl> <dbl>
#1     3  2006    30
#2     3  2009    38
#3     3  2010    39

The same logic can also be implemented using lag.

df %>%
  group_by(id) %>%
  filter(!all(year - lag(year) == age - lag(age))) -> unreasonable_data

Upvotes: 1

Related Questions