krkc_bhdr
krkc_bhdr

Reputation: 61

subsetting panel data that has the complete time dimension

I have a panel data and as you might notice some of the individuals have missing observations in certain times. For instance "C" is missing data point for 2001 and "D" for 2002 and 2003.

> mydata
    id year sales profit
 1:  A 2000  2000    200
 2:  A 2001  2050    245
 3:  A 2002  2100    290
 4:  A 2003  2150    335
 5:  B 2000  2200    380
 6:  B 2001  2250    425
 7:  B 2002  2300    470
 8:  B 2003  2350    515
 9:  C 2000  2400    560
10:  C 2002  2500    650
11:  C 2003  2550    695
12:  D 2000  2600    740
13:  D 2001  2650    785

I tried something like below

subset(mydata, year==c(2000:2003)

the result is as presented below.

   id year sales profit
1:  A 2000  2000    200
2:  A 2001  2050    245
3:  A 2002  2100    290
4:  A 2003  2150    335
5:  B 2000  2200    380
6:  B 2001  2250    425
7:  B 2002  2300    470
8:  B 2003  2350    515
9:  C 2000  2400    560
Warning message:
In year == c(2000:2003) :
  longer object length is not a multiple of shorter object length

What I need is the data that includes entities with full period, starting from beginning, 2000, to the end, 2003. In this case it would be like this.

   id year sales profit
1:  A 2000  2000    200
2:  A 2001  2050    245
3:  A 2002  2100    290
4:  A 2003  2150    335
5:  B 2000  2200    380
6:  B 2001  2250    425
7:  B 2002  2300    470
8:  B 2003  2350    515

Thank you for your time and answer in advance, but I would really appreciate if the answer is a little simpler as I am highly unexperinced and has just begun to learn about R.

Upvotes: 4

Views: 71

Answers (3)

Uwe
Uwe

Reputation: 42544

For the sake of completeness, here is also a data.table solution using a join:

mydata[mydata[, uniqueN(year), by = id][V1 == 4L, .(id)], on = "id"]
   id year sales profit
1:  A 2000  2000    200
2:  A 2001  2050    245
3:  A 2002  2100    290
4:  A 2003  2150    335
5:  B 2000  2200    380
6:  B 2001  2250    425
7:  B 2002  2300    470
8:  B 2003  2350    515

Upvotes: 0

Parfait
Parfait

Reputation: 107567

Consider base R's ave to count id groups and keep only records equal to year length of 4:

data

txt = '    id year sales profit
A 2000  2000    200
A 2001  2050    245
A 2002  2100    290
A 2003  2150    335
B 2000  2200    380
B 2001  2250    425
B 2002  2300    470
B 2003  2350    515
C 2000  2400    560
C 2002  2500    650
C 2003  2550    695
D 2000  2600    740
D 2001  2650    785'

df <- read.table(text=txt, header=TRUE)

code

df$grp_cnt <- ave(df$year, df$id, FUN=length)

df <- transform(subset(df, df$grp_cnt == 4), grp_cnt = NULL)

df    
#   id year sales profit
# 1  A 2000  2000    200
# 2  A 2001  2050    245
# 3  A 2002  2100    290
# 4  A 2003  2150    335
# 5  B 2000  2200    380
# 6  B 2001  2250    425
# 7  B 2002  2300    470
# 8  B 2003  2350    515

Upvotes: 0

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193497

You can try something like:

library(data.table)
mydata[, ind := all(2000:2003 %in% year), id][(ind)]
#    id year sales profit  ind
# 1:  A 2000  2000    200 TRUE
# 2:  A 2001  2050    245 TRUE
# 3:  A 2002  2100    290 TRUE
# 4:  A 2003  2150    335 TRUE
# 5:  B 2000  2200    380 TRUE
# 6:  B 2001  2250    425 TRUE
# 7:  B 2002  2300    470 TRUE
# 8:  B 2003  2350    515 TRUE

With "tidyverse":

library(tidyverse)
mydata %>% 
  group_by(id) %>% 
  filter(all(2000:2003 %in% year))

Sample data (which is how you should share it in the future):

mydata <- structure(list(id = c("A", "A", "A", "A", "B", "B", "B", "B", 
    "C", "C", "C", "D", "D"), year = c(2000L, 2001L, 2002L, 2003L, 
    2000L, 2001L, 2002L, 2003L, 2000L, 2002L, 2003L, 2000L, 2001L
    ), sales = c(2000L, 2050L, 2100L, 2150L, 2200L, 2250L, 2300L, 
    2350L, 2400L, 2500L, 2550L, 2600L, 2650L), profit = c(200L, 245L, 
    290L, 335L, 380L, 425L, 470L, 515L, 560L, 650L, 695L, 740L, 785L
    )), .Names = c("id", "year", "sales", "profit"), row.names = c(NA, 
    13L), class = c("data.table", "data.frame"))

Upvotes: 2

Related Questions