Reputation: 2250
I have written a R code. Please let me know if there is a better way of doing things (which I am sure of, as my code sucks). The data has three rows, called SIC, FYEAR, and SALES. I want to choose only those entries from data where the number of rows for which SIC, and FYEAR are same is greater than 7. The data looks like this:
SIC FYEAR SALE
1 50 2003 651.958
2 50 2004 747.848
3 50 2005 897.284
4 50 2006 1061.169
5 50 2007 1384.919
6 50 2008 1423.976
7 50 2009 1352.151
8 50 2010 1775.782
9 50 2011 2074.498
10 50 2012 2167.100
11 36 2003 773.200
12 36 2004 784.300
13 36 2005 1169.200
14 36 2006 1281.900
15 36 2007 1322.200
16 36 2008 1456.400
17 36 2009 996.700
18 36 2010 1156.600
19 38 2003 18.044
20 38 2004 19.833 ...
Apply doing the following operation,
a <- ddply(data, .(SIC,FYEAR), function(x){length(x$SALE)});
a looks like
SIC FYEAR V1
1 1 2003 16
2 1 2004 13
3 1 2005 13
4 1 2006 11
5 1 2007 11
6 1 2008 12
7 1 2009 14
8 1 2010 17
9 1 2011 16
10 1 2012 20
11 1 2013 5
12 2 2003 2
13 2 2004 2
14 2 2005 2
15 2 2006 2
16 2 2007 3
17 2 2008 4
18 2 2009 4
19 2 2010 5
20 2 2011 3 ...
which says that in the original data there were 16 rows with SIC 1 and FYEAR 2003. Then I select those SIC, FYEAR pair for which V1 >=8.
a <- sqldf("select SIC, FYEAR from a where V1 >=8");
In the end, I merge data and a to get rows for only those SIC, FYEAR pair where #rows were greater or equal to 8. Following line does the trick.
data <- merge(a, data, by= c("SIC","FYEAR"));
data looks like this now:
SIC FYEAR V1 SALE
1 1 2003 16 15.426
2 1 2003 16 338.272
3 1 2003 16 8.433
4 1 2003 16 4936.000
5 1 2003 16 477.405
6 1 2003 16 129.781
7 1 2003 16 0.000
8 1 2003 16 2613.548
9 1 2003 16 26.356
10 1 2003 16 94.988
11 1 2003 16 96.452
12 1 2003 16 18.343
13 1 2003 16 4773.141
14 1 2003 16 281.276
15 1 2003 16 2486.800
16 1 2003 16 48.285
17 1 2004 13 23.624
18 1 2004 13 5457.000
19 1 2004 13 36.431
20 1 2004 13 74.045 ...
Is there a better way to do the same thing, which I have done. Thanks.
Upvotes: 2
Views: 215
Reputation: 89057
Since you are already using plyr
, you can just do:
ddply(data, .(SIC,FYEAR), function(x) if (nrow(x) > 7) x else NULL)
Upvotes: 2