delete rows based on multiple columns and order of entries in R

Question

gvkey = company id

I have created a data frame called df

    executive_id gvkey year
1           5623  1004 2004
2           5625  1004 2004
3           5626  1004 2004
4          24312  1004 2004
5          28074  1004 2004
6           5623  1004 2005
7           5625  1004 2005
8           5626  1004 2005
9          24312  1004 2005
10         28074  1004 2005
11          5623  1004 2006
12          5625  1004 2006
13          5626  1004 2006
14         24312  1004 2006
15         28074  1004 2006
16          5623  1004 2007
17          5625  1004 2007
18          5626  1004 2007
19         24312  1004 2007
20         35684  1004 2007
21          5623  1004 2008
22          5626  1004 2008
23         24312  1004 2008
24         35684  1004 2008
25         38212  1004 2008
26          5623  1004 2009
27          5626  1004 2009
28         24312  1004 2009
29         35684  1004 2009
30         38212  1004 2009
31          5623  1004 2010
32          5626  1004 2010
33         35684  1004 2010
34         38212  1004 2010
35         44350  1004 2010
36          5623  1004 2011
37          5626  1004 2011
38         35684  1004 2011
39         38212  1004 2011
40         44350  1004 2011
41          8741  1013 2004
42         24949  1013 2004
43         26620  1013 2004
44         26621  1013 2004
45         28315  1013 2004
46         28316  1013 2004
47         28317  1013 2004
48         32048  1013 2004
49         32049  1013 2004
50          8741  1013 2005
51         24949  1013 2005
52         26620  1013 2005
53         26621  1013 2005
54         28315  1013 2005
55         32048  1013 2005
56         32049  1013 2005
57          8741  1013 2006
58         26621  1013 2006
59         28315  1013 2006
60         32048  1013 2006
61         32049  1013 2006
62          8741  1013 2007
63         26621  1013 2007
64         28315  1013 2007
65         32048  1013 2007
66         32049  1013 2007
67         33885  1013 2007
68         33886  1013 2007
69          8741  1013 2008
70         26621  1013 2008
71         32048  1013 2008
72         32049  1013 2008
73         33885  1013 2008
74         33886  1013 2008
75         43043  1013 2008
76          8741  1013 2009
77         26621  1013 2009
78         32048  1013 2009
79         32049  1013 2009
80         33885  1013 2009
81         33886  1013 2009
82         43043  1013 2009
83          8741  1013 2010
84         26621  1013 2010
85         32049  1013 2010
86         33885  1013 2010
87         43043  1013 2010
...

This is how I want to delete certain columns of df based on the following logic:

spelled logic:

For each year, delete all executive_id except of the first exectuive_id entry for each year

Final df:

    executive_id gvkey year
1           5623  1004 2004
6           5623  1004 2005
11          5623  1004 2006
16          5623  1004 2007
21          5623  1004 2008
26          5623  1004 2009
31          5623  1004 2010
36          5623  1004 2011
41          8741  1013 2004
50          8741  1013 2005
57          8741  1013 2006
62          8741  1013 2007
69          8741  1013 2008
76          8741  1013 2009
83          8741  1013 2010
...

Additional question:

How must the code look like, if I have two additional columns whic hare no duplciates?

    year gvkey executive_id compensation compensation_change
 1  2004  1004         5623        1287.               11.2 
 2  2004  1004         5625         417.                7.47
 3  2004  1004         5626         550.                4.48
 4  2004  1004        24312         499.               65.9 
 5  2004  1004        28074         549.                5.53
 6  2005  1004         5623        1758.               36.5 
 7  2005  1004         5625         457.                9.46
 8  2005  1004         5626         666.               21.1 
 9  2005  1004        24312         711.               42.4 
10  2005  1004        28074         581.                5.71
...

Thank you so much!!!

ThomasIsCoding · Accepted Answer

I think what you want might be something like below, using duplicated():

df[!duplicated(df[,c("year","gvkey")]),]

such that

> df[!duplicated(df[-1]),]
   executive_id gvkey year
1          5623  1004 2004
6          5623  1004 2005
11         5623  1004 2006
16         5623  1004 2007
21         5623  1004 2008
26         5623  1004 2009
31         5623  1004 2010
36         5623  1004 2011
41         8741  1013 2004
50         8741  1013 2005
57         8741  1013 2006
62         8741  1013 2007
69         8741  1013 2008
76         8741  1013 2009
83         8741  1013 2010

delete rows based on multiple columns and order of entries in R

Answers (1)

Related Questions