Thierno Bocar Diop
Thierno Bocar Diop

Reputation: 23

Delete observations base on specific value of variable in R

I'm reposting this with more information on what I want. I'm fairly new in R. I have a database (panel) and I want to delete some observations based on certain values. Let's take the next panel as an example (derived from plm packages):

Panel <-read.dta("http://dss.princeton.edu/training/Panel101.dta")
> head(Panel)
  country year           y y_bin        x1         x2          x3   opinion op
1       A 1990  1342787840     1 0.2779036 -1.1079559  0.28255358 Str agree  1
2       A 1991 -1899660544     0 0.3206847 -0.9487200  0.49253848     Disag  0
3       A 1992   -11234363     0 0.3634657 -0.7894840  0.70252335     Disag  0
4       A 1993  2645775360     1 0.2461440 -0.8855330 -0.09439092     Disag  0
5       A 1994  3008334848     1 0.4246230 -0.7297683  0.94613063     Disag  0
6       A 1995  3229574144     1 0.4772141 -0.7232460  1.02968037 Str agree  1

I want to delete the observations for the next year following OP=1. For instance if in 1990, OP =1, I want to exclude country in 1991, 1992, 1992, etc (all the next years of the database). If OP =1 in 1996, I want to exclude country in 1997, 1998 and 1999. I want also to keep observations that have OP=0 all the time. I'm running a logit model. Therefore, I'm "comparing" OP=1 and OP=0.

PS : The dataframe may be not be a good example but in my dataframe, OP = 1 only once.

Does anyone know how I can do that?

Thanks in advance.

Upvotes: 0

Views: 958

Answers (2)

TarJae
TarJae

Reputation: 78927

This at least could solve problem Nr. 1. As this dataframe has only year with lag of one. And what do you mean with: keep observations that have OP=0 all the time?

df %>% 
  group_by(country) %>% 
  arrange(country, year) %>% 
  mutate(lag_year = year - lag(year)) %>% 
  filter(op == 1 & lag_year==1) %>% 
  select(-year)

Output:

   country  year           y y_bin     x1     x2     x3 opinion      op
   <fct>   <int>       <dbl> <dbl>  <dbl>  <dbl>  <dbl> <fct>     <dbl>
 1 A        1995  3229574144     1 0.477  -0.723  1.03  Str agree     1
 2 A        1997  2771810560     1 0.0516 -0.705  1.42  Str agree     1
 3 B        1991  -711623744     0 0.106   1.65   0.260 Str agree     1
 4 B        1992 -1933116160     0 0.354   1.59  -0.234 Agree         1
 5 B        1996   577199360     1 0.820   1.53  -0.500 Str agree     1
 6 B        1998  -149072048     0 0.705   1.42  -0.448 Agree         1
 7 C        1991 -3415966464     0 1.18   -1.34   0.284 Str agree     1
 8 C        1992  -355804672     0 1.26   -1.26   0.373 Agree         1
 9 C        1995  1959696640     1 1.16   -1.22   0.695 Agree         1
10 C        1996   530576672     1 1.16   -1.24   0.817 Agree         1
# … with 21 more rows

Upvotes: 1

MKR
MKR

Reputation: 1700

Given this sample data:

df <- structure(list(country = c("a", "a", "a", "a", "b", "b", "b"), 
    year = c(91, 92, 93, 94, 91, 92, 93), op = c(1, 0, 0, 1, 
    0, 0, 0)), row.names = c(NA, -7L), class = c("tbl_df", "tbl", 
"data.frame"))
# A tibble: 7 x 3
  country  year    op
  <chr>   <dbl> <dbl>
1 a          91     1
2 a          92     0
3 a          93     0
4 a          94     1
5 b          91     0
6 b          92     0
7 b          93     0

You can subset it by first creating a separate column has_1 that tells you whether there is op == 1 in a given country in any year. Then you can filter for (op == 1 & has_1 == 1) | (op == 0 & has_1 == 0).

df %>% group_by(country) %>% 
  mutate(has_1 = max(op)) %>% 
  filter(
    (op == 1 & has_1 ==1) | (op ==0 & has_1 == 0)
  )

Which results in:

# A tibble: 5 x 4
# Groups:   country [2]
  country  year    op has_1
  <chr>   <dbl> <dbl> <dbl>
1 a          91     1     1
2 a          94     1     1
3 b          91     0     0
4 b          92     0     0
5 b          93     0     0

Upvotes: 1

Related Questions