Reputation: 153
gvkey
= company id
I have created a data frame called df
executive_id gvkey year
1 5623 1004 2004
2 5625 1004 2004
3 5626 1004 2004
4 24312 1004 2004
5 28074 1004 2004
6 5623 1004 2005
7 5625 1004 2005
8 5626 1004 2005
9 24312 1004 2005
10 28074 1004 2005
11 5623 1004 2006
12 5625 1004 2006
13 5626 1004 2006
14 24312 1004 2006
15 28074 1004 2006
16 5623 1004 2007
17 5625 1004 2007
18 5626 1004 2007
19 24312 1004 2007
20 35684 1004 2007
21 5623 1004 2008
22 5626 1004 2008
23 24312 1004 2008
24 35684 1004 2008
25 38212 1004 2008
26 5623 1004 2009
27 5626 1004 2009
28 24312 1004 2009
29 35684 1004 2009
30 38212 1004 2009
31 5623 1004 2010
32 5626 1004 2010
33 35684 1004 2010
34 38212 1004 2010
35 44350 1004 2010
36 5623 1004 2011
37 5626 1004 2011
38 35684 1004 2011
39 38212 1004 2011
40 44350 1004 2011
41 8741 1013 2004
42 24949 1013 2004
43 26620 1013 2004
44 26621 1013 2004
45 28315 1013 2004
46 28316 1013 2004
47 28317 1013 2004
48 32048 1013 2004
49 32049 1013 2004
50 8741 1013 2005
51 24949 1013 2005
52 26620 1013 2005
53 26621 1013 2005
54 28315 1013 2005
55 32048 1013 2005
56 32049 1013 2005
57 8741 1013 2006
58 26621 1013 2006
59 28315 1013 2006
60 32048 1013 2006
61 32049 1013 2006
62 8741 1013 2007
63 26621 1013 2007
64 28315 1013 2007
65 32048 1013 2007
66 32049 1013 2007
67 33885 1013 2007
68 33886 1013 2007
69 8741 1013 2008
70 26621 1013 2008
71 32048 1013 2008
72 32049 1013 2008
73 33885 1013 2008
74 33886 1013 2008
75 43043 1013 2008
76 8741 1013 2009
77 26621 1013 2009
78 32048 1013 2009
79 32049 1013 2009
80 33885 1013 2009
81 33886 1013 2009
82 43043 1013 2009
83 8741 1013 2010
84 26621 1013 2010
85 32049 1013 2010
86 33885 1013 2010
87 43043 1013 2010
...
This is how I want to delete certain columns of df
based on the following logic:
spelled logic:
For each year
, delete all executive_id
except of the first exectuive_id
entry for each year
Final df
:
executive_id gvkey year
1 5623 1004 2004
6 5623 1004 2005
11 5623 1004 2006
16 5623 1004 2007
21 5623 1004 2008
26 5623 1004 2009
31 5623 1004 2010
36 5623 1004 2011
41 8741 1013 2004
50 8741 1013 2005
57 8741 1013 2006
62 8741 1013 2007
69 8741 1013 2008
76 8741 1013 2009
83 8741 1013 2010
...
Additional question:
How must the code look like, if I have two additional columns whic hare no duplciates?
year gvkey executive_id compensation compensation_change
1 2004 1004 5623 1287. 11.2
2 2004 1004 5625 417. 7.47
3 2004 1004 5626 550. 4.48
4 2004 1004 24312 499. 65.9
5 2004 1004 28074 549. 5.53
6 2005 1004 5623 1758. 36.5
7 2005 1004 5625 457. 9.46
8 2005 1004 5626 666. 21.1
9 2005 1004 24312 711. 42.4
10 2005 1004 28074 581. 5.71
...
Thank you so much!!!
Upvotes: 0
Views: 38
Reputation: 102920
I think what you want might be something like below, using duplicated()
:
df[!duplicated(df[,c("year","gvkey")]),]
such that
> df[!duplicated(df[-1]),]
executive_id gvkey year
1 5623 1004 2004
6 5623 1004 2005
11 5623 1004 2006
16 5623 1004 2007
21 5623 1004 2008
26 5623 1004 2009
31 5623 1004 2010
36 5623 1004 2011
41 8741 1013 2004
50 8741 1013 2005
57 8741 1013 2006
62 8741 1013 2007
69 8741 1013 2008
76 8741 1013 2009
83 8741 1013 2010
Upvotes: 2