Reputation: 857
I am trying to clean my age variable from data entry discrepancies in a panel data that follow individuals over time. Many respondents have a jump in their age from one observation to another because they have missed a few waves and then came back as we can see for the persons below with ID 1 and 2. However, the person with ID 3 had a jump in age that is not equal to the year that s/he was out of the panel.
Could someone please guide me on how to filter out respondents from my data that have unreasonable change in their age that is not equal to the number of years they were out of the panel but to other reasons such as data entry issues?
id year age
1 2005 50
1 2006 51
1 2010 55
2 2002 38
2 2005 41
2 2006 42
3 2006 30
3 2009 38
3 2010 39
structure(list(id = structure(c(1, 1, 1, 2, 2, 2, 3, 3, 3), format.stata = "%9.0g"),
year = structure(c(2005, 2006, 2010, 2002, 2005, 2006, 2006,
2009, 2010), format.stata = "%9.0g"), age = structure(c(50,
51, 55, 38, 41, 42, 30, 38, 39), format.stata = "%9.0g")), row.names = c(NA,
-9L), class = c("tbl_df", "tbl", "data.frame"))
Upvotes: 1
Views: 87
Reputation: 887691
We can use diff
library(dplyr)
df %>%
group_by(id) %>%
filter(!all(diff(year) == diff(age)))
-output
# A tibble: 3 x 3
# Groups: id [1]
# id year age
# <dbl> <dbl> <dbl>
#1 3 2006 30
#2 3 2009 38
#3 3 2010 39
Upvotes: 1
Reputation: 389175
You can filter out the id
's whose change in year
and age
is not in sync.
library(dplyr)
df %>%
group_by(id) %>%
filter(!all(year - min(year) == age - min(age))) -> unreasonable_data
unreasonable_data
# id year age
# <dbl> <dbl> <dbl>
#1 3 2006 30
#2 3 2009 38
#3 3 2010 39
The same logic can also be implemented using lag
.
df %>%
group_by(id) %>%
filter(!all(year - lag(year) == age - lag(age))) -> unreasonable_data
Upvotes: 1