Reputation: 77

Selecting unique non-repeating values

I have some panel data from 2004-2007 which I would like to select according to unique values. To be more precise im trying to find out entry and exits of individual stores throughout the period. Data sample:

 store year    rev space  market
     1 2004 110000  1095     136
     1 2005 110000  1095     136
     1 2006 110000  1095     136
     1 2007 120000  1095     136
     2 2004  35000   800     136
     3 2004  45000  1000     136
     3 2005  45000  1000     136
     3 2006  45000  1000     136
     3 2007  45000  1000     136
     4 2005  17500   320     136
     4 2006  17500   320     136
     4 2007  17500   320     136
     5 2005  45000   580     191
     5 2006  45000   580     191
     5 2007  45000   580     191
     6 2004   7000   345     191
     6 2005   7000   345     191
     6 2006   7000   345     191
     7 2007  10000   500     191

So for instance I would like to find out how many stores have exited the market throughout the period, which should look like:

 store year   rev space  market
     2 2004 35000   800     136
     6 2006  7000   345     191

As well as how many stores have entered the market, which would imply:

 store year    rev space  market
     4 2005  17500   320     136
     5 2005  45000   580     191
     7 2007  10000   500     191

UPDATE: I didn't include that it also should assume incumbent stores, such as:

 store year    rev  space  market
     1 2004 110000   1095     136
     1 2005 110000   1095     136
     1 2006 110000   1095     136
     1 2007 120000   1095     136     
     3 2004  45000   1000     136
     3 2005  45000   1000     136
     3 2006  45000   1000     136
     3 2007  45000   1000     136

Since im, pretty new to R I've been struggling to do it right even on year-by-year basis. Any suggestions?

Upvotes: 2

Answers (2)

Alex Brown

Reputation: 42932

Using only base R functions, this is pretty simple:

> subset(aggregate(df["year"],df["store"],max),year!=2007)
  store year
2     2 2004
6     6 2006

and

> subset(aggregate(df["year"],df["store"],min),year!=2004)
  store year
4     4 2005
5     5 2005
7     7 2007

or using formula syntax:

> subset(aggregate(year~store,df,max),year!=2007)
  store year
2     2 2004
6     6 2006

and

> subset(aggregate(year~store,df,min),year!=2004)
  store year
4     4 2005
5     5 2005
7     7 2007

Update Getting all the columns isn't possible for aggregate, so we can use base 'by' instead. By isn't as clever at reassembling the array:

Filter(function(x)x$year!=2007,by(df,df$store,function(s)s[s$year==max(s$year),]))

$`2`
  store year   rev space market
5     2 2004 35000   800    136

$`6`
   store year  rev space market
18     6 2006 7000   345    191

So we need to take that step - let's build a little wrapper:

by2=function(x,c,...){Reduce(rbind,by(x,x[c],simplify=FALSE,...))}

And now use that instead:

> subset(by2(df,"store",function(s)s[s$year==max(s$year),]),year!=2007)
   store year   rev space market
5      2 2004 35000   800    136
18     6 2006  7000   345    191

We can further clarify this by creating a function for getting a row which has the stat (min or max) for a particular column:

statmatch=function(column,stat)function(df){df[df[column]==stat(df[column]),]}

> subset(by2(df,"store",statmatch("year",max)),year!=2007)
   store year   rev space market
5      2 2004 35000   800    136
18     6 2006  7000   345    191

Dplyr

Using all of these base functions which don't really resemble each other starts to get fiddly after a while, so it's a great idea to learn and use the excellent (and performant) dplyr package:

> df %>% group_by(store) %>%
         arrange(-year) %>% slice(1) %>%
         filter(year != 2007) %>% ungroup

Source: local data frame [2 x 5]

  store year   rev space market
1     2 2004 35000   800    136
2     6 2006  7000   345    191

and

> df %>% group_by(store) %>% 
         arrange(+year) %>% slice(1) %>% 
         filter(year != 2004) %>% ungroup

Source: local data frame [3 x 5]

  store year   rev space market
1     4 2005 17500   320    136
2     5 2005 45000   580    191
3     7 2007 10000   500    191

NB The ungroup is not strictly necessary here, but puts the table back in a default state for further calculations.

Upvotes: 3

Señor O

Reputation: 17432

Using the data.table package, if your data.frame is called df:

dt = data.table(df)
exit = dt[,list(ExitYear = max(year)),by=store]
exit = exit[ExitYear != 2007] #Or whatever the "current year" is for this table

enter = dt[,list(EntryYear = min(year)),by=store]
enter = enter[EntryYear != 2003]

UPDATE

To get all columns instead of just the year and store, you can do:

exit = dt[,.SD[year == max(year)], by=store]
exit[year != 2007]
   store year   rev space market
1:     2 2004 35000   800    136
2:     6 2006  7000   345    191

Upvotes: 5

Selecting unique non-repeating values

Answers (2)

Dplyr

Related Questions