Reputation: 13
I have a set of 500 firms with investment data over six years. I want to create a balanced panel of ONLY the firms that have data for all six years, but I can't seem to find an easy way to do this in R.
My data looks like:
I want to drop a firm for all years if it has missing values for just one year. I've tried different conditional subsetting commands but nothing is working for me. Ideas?
Upvotes: 0
Views: 71
Reputation: 160417
Three methods:
dat <- data.frame(
firm = c(1, 1, 1, 2, 2, 2),
year = c(1, 2, 3, 1, 2, 3),
val = c(1, 1, 1, 2, NA, NA)
)
do.call(
rbind.data.frame,
by(dat, dat$firm, function(x) if (!any(is.na(x$val))) x)
)
# year firm val
# 1.1 1 1 1
# 1.2 2 1 1
# 1.3 3 1 1
or
datrm <- dat[!complete.cases(dat),]
dat[ !dat$firm %in% datrm$firm, ]
data.table
library(data.table)
datDT <- as.data.table(dat)
datDT[, .SD[!any(is.na(val)),], by = "firm" ]
dplyr
library(dplyr)
dat %>%
group_by(firm) %>%
filter(!any(is.na(val))) %>%
ungroup()
Upvotes: 3