Reputation: 77
I have some panel data from 2004-2007 which I would like to select according to unique values. To be more precise im trying to find out entry and exits of individual stores throughout the period. Data sample:
store year rev space market
1 2004 110000 1095 136
1 2005 110000 1095 136
1 2006 110000 1095 136
1 2007 120000 1095 136
2 2004 35000 800 136
3 2004 45000 1000 136
3 2005 45000 1000 136
3 2006 45000 1000 136
3 2007 45000 1000 136
4 2005 17500 320 136
4 2006 17500 320 136
4 2007 17500 320 136
5 2005 45000 580 191
5 2006 45000 580 191
5 2007 45000 580 191
6 2004 7000 345 191
6 2005 7000 345 191
6 2006 7000 345 191
7 2007 10000 500 191
So for instance I would like to find out how many stores have exited the market throughout the period, which should look like:
store year rev space market
2 2004 35000 800 136
6 2006 7000 345 191
As well as how many stores have entered the market, which would imply:
store year rev space market
4 2005 17500 320 136
5 2005 45000 580 191
7 2007 10000 500 191
UPDATE: I didn't include that it also should assume incumbent stores, such as:
store year rev space market
1 2004 110000 1095 136
1 2005 110000 1095 136
1 2006 110000 1095 136
1 2007 120000 1095 136
3 2004 45000 1000 136
3 2005 45000 1000 136
3 2006 45000 1000 136
3 2007 45000 1000 136
Since im, pretty new to R I've been struggling to do it right even on year-by-year basis. Any suggestions?
Upvotes: 2
Views: 179
Reputation: 42872
Using only base R functions, this is pretty simple:
> subset(aggregate(df["year"],df["store"],max),year!=2007)
store year
2 2 2004
6 6 2006
and
> subset(aggregate(df["year"],df["store"],min),year!=2004)
store year
4 4 2005
5 5 2005
7 7 2007
or using formula syntax:
> subset(aggregate(year~store,df,max),year!=2007)
store year
2 2 2004
6 6 2006
and
> subset(aggregate(year~store,df,min),year!=2004)
store year
4 4 2005
5 5 2005
7 7 2007
Update Getting all the columns isn't possible for aggregate, so we can use base 'by' instead. By isn't as clever at reassembling the array:
Filter(function(x)x$year!=2007,by(df,df$store,function(s)s[s$year==max(s$year),]))
$`2`
store year rev space market
5 2 2004 35000 800 136
$`6`
store year rev space market
18 6 2006 7000 345 191
So we need to take that step - let's build a little wrapper:
by2=function(x,c,...){Reduce(rbind,by(x,x[c],simplify=FALSE,...))}
And now use that instead:
> subset(by2(df,"store",function(s)s[s$year==max(s$year),]),year!=2007)
store year rev space market
5 2 2004 35000 800 136
18 6 2006 7000 345 191
We can further clarify this by creating a function for getting a row which has the stat (min or max) for a particular column:
statmatch=function(column,stat)function(df){df[df[column]==stat(df[column]),]}
> subset(by2(df,"store",statmatch("year",max)),year!=2007)
store year rev space market
5 2 2004 35000 800 136
18 6 2006 7000 345 191
Using all of these base functions which don't really resemble each other starts to get fiddly after a while, so it's a great idea to learn and use the excellent (and performant) dplyr package:
> df %>% group_by(store) %>%
arrange(-year) %>% slice(1) %>%
filter(year != 2007) %>% ungroup
Source: local data frame [2 x 5]
store year rev space market
1 2 2004 35000 800 136
2 6 2006 7000 345 191
and
> df %>% group_by(store) %>%
arrange(+year) %>% slice(1) %>%
filter(year != 2004) %>% ungroup
Source: local data frame [3 x 5]
store year rev space market
1 4 2005 17500 320 136
2 5 2005 45000 580 191
3 7 2007 10000 500 191
NB The ungroup is not strictly necessary here, but puts the table back in a default state for further calculations.
Upvotes: 3
Reputation: 17412
Using the data.table
package, if your data.frame
is called df
:
dt = data.table(df)
exit = dt[,list(ExitYear = max(year)),by=store]
exit = exit[ExitYear != 2007] #Or whatever the "current year" is for this table
enter = dt[,list(EntryYear = min(year)),by=store]
enter = enter[EntryYear != 2003]
UPDATE
To get all columns instead of just the year and store, you can do:
exit = dt[,.SD[year == max(year)], by=store]
exit[year != 2007]
store year rev space market
1: 2 2004 35000 800 136
2: 6 2006 7000 345 191
Upvotes: 5