Reputation: 391
I have a data set of company directors. For example, for company X in 2005 they have 3 directors. So for company x in 2005 have three observations. Each director has a unique ID. Now I want to filter only those observations in which this year directors and previous year directors are same (they are same in its entirety; if this year' member includes 1 new member and previous years' 2 old members; I do not want those observations). Each director has a unique ID. Also Each company has a unique ID such as ISIN.
The data set looks like this for only one company -
ISIN year DirectorName DirectorID
1 US9898171015 2006 Thomas (Tom) E Davin 2247441792
2 US9898171015 2006 Matthew (Matt) L Hyde 4842568996
3 US9898171015 2007 James (Jim) M Weber 3581636766
4 US9898171015 2007 Matthew (Matt) L Hyde 4842568996
5 US9898171015 2007 David (Dave) M DeMattei 759047198
6 US9898171015 2008 James (Jim) M Weber 3581636766
7 US9898171015 2008 Matthew (Matt) L Hyde 4842568996
8 US9898171015 2008 David (Dave) M DeMattei 759047198
9 US9898171015 2009 William (Bill) Milroy Barnum Jr 20462211719
10 US9898171015 2009 James (Jim) M Weber 3581636766
11 US9898171015 2009 Matthew (Matt) L Hyde 4842568996
12 US9898171015 2009 David (Dave) M DeMattei 759047198
13 US9898171015 2010 William (Bill) Milroy Barnum Jr 20462211719
14 US9898171015 2010 James (Jim) M Weber 3581636766
15 US9898171015 2010 Matthew (Matt) L Hyde 4842568996
16 US9898171015 2011 Sarah (Sally) Gaines McCoy 11434863691
17 US9898171015 2011 William (Bill) Milroy Barnum Jr 20462211719
18 US9898171015 2011 James (Jim) M Weber 3581636766
19 US9898171015 2011 Matthew (Matt) L Hyde 4842568996
20 US9898171015 2012 Sarah (Sally) Gaines McCoy 11434863691
21 US9898171015 2012 Ernest R Johnson 40425210975
22 US9898171015 2013 Sarah (Sally) Gaines McCoy 11434863691
23 US9898171015 2013 Ernest R Johnson 40425210975
24 US9898171015 2013 Travis D Smith 53006212569
25 US9898171015 2014 Sarah (Sally) Gaines McCoy 11434863691
26 US9898171015 2014 Ernest R Johnson 40425210975
27 US9898171015 2014 Travis D Smith 53006212569
28 US9898171015 2015 Kalen F Holmes 11051172801
29 US9898171015 2015 Sarah (Sally) Gaines McCoy 11434863691
30 US9898171015 2015 Ernest R Johnson 40425210975
31 US9898171015 2015 Travis D Smith 53006212569
32 US9898171015 2016 Sarah (Sally) Gaines McCoy 11434863691
33 US9898171015 2016 Ernest R Johnson 40425210975
34 US9898171015 2016 Travis D Smith 53006212569
35 US9898171015 2017 Sarah (Sally) Gaines McCoy 11434863691
36 US9898171015 2017 Scott Andrew Bailey 174000000000
37 US9898171015 2017 Ernest R Johnson 40425210975
38 US9898171015 2017 Travis D Smith 53006212569
I tried these codes
endo <- ac %>%
group_by(ISIN) %>%
filter(DirectorID == lag (DirectorID, 1))
after using the above code, I got the following results.
ISIN year DirectorName DirectorID
1 US9898171015 2007 Matthew (Matt) L Hyde 4842568996
2 US9898171015 2008 James (Jim) M Weber 3581636766
3 US9898171015 2008 Matthew (Matt) L Hyde 4842568996
4 US9898171015 2008 David (Dave) M DeMattei 759047198
5 US9898171015 2009 James (Jim) M Weber 3581636766
6 US9898171015 2009 Matthew (Matt) L Hyde 4842568996
7 US9898171015 2009 David (Dave) M DeMattei 759047198
8 US9898171015 2010 William (Bill) Milroy Barnum Jr 20462211719
9 US9898171015 2010 James (Jim) M Weber 3581636766
10 US9898171015 2010 Matthew (Matt) L Hyde 4842568996
11 US9898171015 2011 William (Bill) Milroy Barnum Jr 20462211719
12 US9898171015 2011 James (Jim) M Weber 3581636766
13 US9898171015 2011 Matthew (Matt) L Hyde 4842568996
14 US9898171015 2012 Sarah (Sally) Gaines McCoy 11434863691
15 US9898171015 2013 Sarah (Sally) Gaines McCoy 11434863691
16 US9898171015 2013 Ernest R Johnson 40425210975
17 US9898171015 2014 Sarah (Sally) Gaines McCoy 11434863691
18 US9898171015 2014 Ernest R Johnson 40425210975
19 US9898171015 2014 Travis D Smith 53006212569
20 US9898171015 2015 Sarah (Sally) Gaines McCoy 11434863691
21 US9898171015 2015 Ernest R Johnson 40425210975
22 US9898171015 2015 Travis D Smith 53006212569
23 US9898171015 2016 Sarah (Sally) Gaines McCoy 11434863691
24 US9898171015 2016 Ernest R Johnson 40425210975
25 US9898171015 2016 Travis D Smith 53006212569
26 US9898171015 2017 Sarah (Sally) Gaines McCoy 11434863691
27 US9898171015 2017 Ernest R Johnson 40425210975
28 US9898171015 2017 Travis D Smith 53006212569
If the first data (data before using the code) is inspected manually, it is clear that only for years 2007 and 2008; and 2013 and 2014, the composition of boards were the same. So I want these observations ONLY.
but second data (Data After using the code) did not produce the expected results.
The expected results are here -
ISIN year DirectorName DirectorID
1 US9898171015 2007 James (Jim) M Weber 3581636766
2 US9898171015 2007 Matthew (Matt) L Hyde 4842568996
3 US9898171015 2007 David (Dave) M DeMattei 759047198
4 US9898171015 2008 James (Jim) M Weber 3581636766
5 US9898171015 2008 Matthew (Matt) L Hyde 4842568996
6 US9898171015 2008 David (Dave) M DeMattei 759047198
7 US9898171015 2013 Sarah (Sally) Gaines McCoy 11434863691
8 US9898171015 2013 Ernest R Johnson 40425210975
9 US9898171015 2013 Travis D Smith 53006212569
10 US9898171015 2014 Sarah (Sally) Gaines McCoy 11434863691
11 US9898171015 2014 Ernest R Johnson 40425210975
12 US9898171015 2014 Travis D Smith 53006212569
I appreciate your help.
Upvotes: 0
Views: 184
Reputation: 8936
This is verbose and likely inefficient, but it gets the job done using nested data frames.
library(dplyr)
library(purrr)
library(readr)
library(tidyr)
"ROW,ISIN,YEAR,DIRECTOR_NAME,DIRECTOR_ID
1,US9898171015,2006,Thomas (Tom) E Davin,2247441792
2,US9898171015,2006,Matthew (Matt) L Hyde,4842568996
3,US9898171015,2007,James (Jim) M Weber,3581636766
4,US9898171015,2007,Matthew (Matt) L Hyde,4842568996
5,US9898171015,2007,David (Dave) M DeMattei,759047198
6,US9898171015,2008,James (Jim) M Weber,3581636766
7,US9898171015,2008,Matthew (Matt) L Hyde,4842568996
8,US9898171015,2008,David (Dave) M DeMattei,759047198
9,US9898171015,2009,William (Bill) Milroy Barnum Jr,20462211719
10,US9898171015,2009,James (Jim) M Weber,3581636766
11,US9898171015,2009,Matthew (Matt) L Hyde,4842568996
12,US9898171015,2009,David (Dave) M DeMattei,759047198
13,US9898171015,2010,William (Bill) Milroy Barnum Jr,20462211719
14,US9898171015,2010,James (Jim) M Weber,3581636766
15,US9898171015,2010,Matthew (Matt) L Hyde,4842568996
16,US9898171015,2011,Sarah (Sally) Gaines McCoy,11434863691
17,US9898171015,2011,William (Bill) Milroy Barnum Jr,20462211719
18,US9898171015,2011,James (Jim) M Weber,3581636766
19,US9898171015,2011,Matthew (Matt) L Hyde,4842568996
20,US9898171015,2012,Sarah (Sally) Gaines McCoy,11434863691
21,US9898171015,2012,Ernest R Johnson,40425210975
22,US9898171015,2013,Sarah (Sally) Gaines McCoy,11434863691
23,US9898171015,2013,Ernest R Johnson,40425210975
24,US9898171015,2013,Travis D Smith,53006212569
25,US9898171015,2014,Sarah (Sally) Gaines McCoy,11434863691
26,US9898171015,2014,Ernest R Johnson,40425210975
27,US9898171015,2014,Travis D Smith,53006212569
28,US9898171015,2015,Kalen F Holmes,11051172801
29,US9898171015,2015,Sarah (Sally) Gaines McCoy,11434863691
30,US9898171015,2015,Ernest R Johnson,40425210975
31,US9898171015,2015,Travis D Smith,53006212569
32,US9898171015,2016,Sarah (Sally) Gaines McCoy,11434863691
33,US9898171015,2016,Ernest R Johnson,40425210975
34,US9898171015,2016,Travis D Smith,53006212569
35,US9898171015,2017,Sarah (Sally) Gaines McCoy,11434863691
36,US9898171015,2017,Scott Andrew Bailey,174000000000
37,US9898171015,2017,Ernest R Johnson,40425210975
38,US9898171015,2017,Travis D Smith,53006212569
" %>%
read_csv() %>%
group_by(ISIN, YEAR) %>%
nest(.key = "OTHER_DATA") %>%
group_by(ISIN) %>%
mutate(OTHER_DATA_LAG = lag(OTHER_DATA, 1),
OTHER_DATA_LEAD = lead(OTHER_DATA, 1),
KEEP = pmap(list(OTHER_DATA_LAG, OTHER_DATA, OTHER_DATA_LEAD), function(x, y, z) {
isTRUE(all_equal(x["DIRECTOR_ID"], y["DIRECTOR_ID"])) ||
isTRUE(all_equal(y["DIRECTOR_ID"], z["DIRECTOR_ID"]))
})) %>%
filter(unlist(KEEP)) %>%
select(-OTHER_DATA_LAG, -OTHER_DATA_LEAD, -KEEP) %>%
unnest() %>%
ungroup()
Upvotes: 1
Reputation: 365
It seems like what you are trying to do is to identify when a repeat occurs. You might want
a <- c(1,2,2,3)
a == lag(a)
to yield TRUE for 3 and FALSE elsewhere. But it doesn't, so what's going on?
The issue with lag
is discussed more in this blogpost https://heuristically.wordpress.com/2012/10/29/lag-function-for-data-frames/
The blogpost has a more sophisticated version, but for your needs, the following might be sufficient:
mylag <- function(v) { c(NA, head(v, -1)) }
a == mylag(a)
Upvotes: 0