Reputation: 23
I'm reposting this with more information on what I want. I'm fairly new in R. I have a database (panel) and I want to delete some observations based on certain values. Let's take the next panel as an example (derived from plm packages):
Panel <-read.dta("http://dss.princeton.edu/training/Panel101.dta")
> head(Panel)
country year y y_bin x1 x2 x3 opinion op
1 A 1990 1342787840 1 0.2779036 -1.1079559 0.28255358 Str agree 1
2 A 1991 -1899660544 0 0.3206847 -0.9487200 0.49253848 Disag 0
3 A 1992 -11234363 0 0.3634657 -0.7894840 0.70252335 Disag 0
4 A 1993 2645775360 1 0.2461440 -0.8855330 -0.09439092 Disag 0
5 A 1994 3008334848 1 0.4246230 -0.7297683 0.94613063 Disag 0
6 A 1995 3229574144 1 0.4772141 -0.7232460 1.02968037 Str agree 1
I want to delete the observations for the next year following OP=1. For instance if in 1990, OP =1, I want to exclude country in 1991, 1992, 1992, etc (all the next years of the database). If OP =1 in 1996, I want to exclude country in 1997, 1998 and 1999. I want also to keep observations that have OP=0 all the time. I'm running a logit model. Therefore, I'm "comparing" OP=1 and OP=0.
PS : The dataframe may be not be a good example but in my dataframe, OP = 1 only once.
Does anyone know how I can do that?
Thanks in advance.
Upvotes: 0
Views: 958
Reputation: 78927
This at least could solve problem Nr. 1. As this dataframe has only year with lag of one. And what do you mean with: keep observations that have OP=0 all the time?
df %>%
group_by(country) %>%
arrange(country, year) %>%
mutate(lag_year = year - lag(year)) %>%
filter(op == 1 & lag_year==1) %>%
select(-year)
Output:
country year y y_bin x1 x2 x3 opinion op
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 A 1995 3229574144 1 0.477 -0.723 1.03 Str agree 1
2 A 1997 2771810560 1 0.0516 -0.705 1.42 Str agree 1
3 B 1991 -711623744 0 0.106 1.65 0.260 Str agree 1
4 B 1992 -1933116160 0 0.354 1.59 -0.234 Agree 1
5 B 1996 577199360 1 0.820 1.53 -0.500 Str agree 1
6 B 1998 -149072048 0 0.705 1.42 -0.448 Agree 1
7 C 1991 -3415966464 0 1.18 -1.34 0.284 Str agree 1
8 C 1992 -355804672 0 1.26 -1.26 0.373 Agree 1
9 C 1995 1959696640 1 1.16 -1.22 0.695 Agree 1
10 C 1996 530576672 1 1.16 -1.24 0.817 Agree 1
# … with 21 more rows
Upvotes: 1
Reputation: 1700
Given this sample data:
df <- structure(list(country = c("a", "a", "a", "a", "b", "b", "b"),
year = c(91, 92, 93, 94, 91, 92, 93), op = c(1, 0, 0, 1,
0, 0, 0)), row.names = c(NA, -7L), class = c("tbl_df", "tbl",
"data.frame"))
# A tibble: 7 x 3
country year op
<chr> <dbl> <dbl>
1 a 91 1
2 a 92 0
3 a 93 0
4 a 94 1
5 b 91 0
6 b 92 0
7 b 93 0
You can subset it by first creating a separate column has_1
that tells you whether there is op == 1
in a given country in any year. Then you can filter for (op == 1 & has_1 == 1) | (op == 0 & has_1 == 0)
.
df %>% group_by(country) %>%
mutate(has_1 = max(op)) %>%
filter(
(op == 1 & has_1 ==1) | (op ==0 & has_1 == 0)
)
Which results in:
# A tibble: 5 x 4
# Groups: country [2]
country year op has_1
<chr> <dbl> <dbl> <dbl>
1 a 91 1 1
2 a 94 1 1
3 b 91 0 0
4 b 92 0 0
5 b 93 0 0
Upvotes: 1