Reputation: 1
I have a csv file with 3 columns year
,netInc
,Organization
. I want to clean the data and consider only the Organization whose values are present in all the financial years . For instance I want to consider all years between 2000 and 2003 (count = 4) and don't want data of B and D as it has data for few years only . So how should I do this to get new dataframe having data for 2000 to 2003 and having 50k observations?
Year netInc Org
2000 10 A
2001 20 A
2002 30 A
2003 40 A
2000 5 B
2001 7 B
2000 10 C
2001 20 C
2002 30 C
2003 40 C
2010 10 D
Upvotes: 0
Views: 96
Reputation: 887851
We can use a group by filter
i.e. grouped by 'Org', filter
only those 'Org' having all
the 'Year' from 2000 to 2003
library(dplyr)
df1 %>%
group_by(Org) %>%
filter(all(2000:2003 %in% Year)) %>%
ungroup
-output
# A tibble: 8 x 3
Year netInc Org
<int> <int> <chr>
1 2000 10 A
2 2001 20 A
3 2002 30 A
4 2003 40 A
5 2000 10 C
6 2001 20 C
7 2002 30 C
8 2003 40 C
df1 <- structure(list(Year = c(2000L, 2001L, 2002L, 2003L, 2000L, 2001L,
2000L, 2001L, 2002L, 2003L, 2010L), netInc = c(10L, 20L, 30L,
40L, 5L, 7L, 10L, 20L, 30L, 40L, 10L), Org = c("A", "A", "A",
"A", "B", "B", "C", "C", "C", "C", "D")),
class = "data.frame", row.names = c(NA,
-11L))
Upvotes: 3