Reputation: 12178
This is a repost from stats.stackexchange where I did not get a satisfactory response. I have two datasets, the first on schools, and the second lists students in each school who have failed in a standardized test (emphasis intentional). Fake datasets can be generated by (thanks to Tharen):
#random school data for 30 schools
schools.num = 30
schools.data = data.frame(school_id=seq(1,schools.num)
,tot_white=sample(100:300,schools.num,TRUE)
,tot_black=sample(100:300,schools.num,TRUE)
,tot_asian=sample(100:300,schools.num,TRUE)
,school_rev=sample(4e6:6e6,schools.num,TRUE)
)
#total students in each school
schools.data$tot_students = schools.data$tot_white + schools.data$tot_black + schools.data$tot_asian
#sum of all students all schools
tot_students = sum(schools.data$tot_white, schools.data$tot_black, schools.data$tot_asian)
#generate some random failing students
fail.num = as.integer(tot_students * 0.05)
students = data.frame(student_id=sample(seq(1:tot_students), fail.num, FALSE)
,school_id=sample(1:schools.num, fail.num, TRUE)
,race=sample(c('white', 'black', 'asian'), fail.num, TRUE)
)
I am trying to estimate P(Fail=1 | Student Race, School Revenue). If I run a multinomial discrete choice model on the student dataset, I shall clearly be estimating P(Race | Fail=1). I obviously have to estimate the inverse of this. Since all the pieces of information are available in the two datasets (P(Fail), P(Race), Revenue), I see no reason why this can't be done. But I am stumped as to actually how to implement in R. Any pointer would be much appreciated. Thanks.
Upvotes: 1
Views: 2197
Reputation: 32351
It will be easier if you have a single data.frame.
library(reshape2)
library(plyr)
d1 <- ddply(
students,
c("school_id", "race"),
summarize,
fail=length(student_id)
)
d2 <- with( schools.data, data.frame(
school_id = school_id,
white = tot_white,
black = tot_black,
asian = tot_asian,
school_rev = school_rev
) )
d2 <- melt(d2,
id.vars=c("school_id", "school_rev"),
variable.name="race",
value.name="total"
)
d <- merge( d1, d2, by=c("school_id", "race") )
d$pass <- d$total - d$fail
You can then look at the data
library(lattice)
xyplot( d$fail / d$total ~ school_rev | race, data=d )
Or compute anything you want.
r <- glm(
cbind(fail,pass) ~ race + school_rev,
data=d,
family=binomial() # Logistic regression (not bayesian)
)
summary(r)
(EDIT) If you have more information about the failed students, but only aggregated data for the passed ones, you can recreate a complete dataset as follows.
# Unique student_id for the passed students
d3 <- ddply( d,
c("school_id", "race"),
summarize, student_id=1:pass
)
d3$student_id <- - seq_len(nrow(d3))
# All students
d3$result <- "pass"
students$result <- "fail"
d3 <- merge( # rather than rbind, in case there are more columns
d3, students,
by=c("student_id", "school_id", "race", "result"),
all=TRUE
)
# Students and schools in a single data.frame
d3 <- merge( d3, schools.data, by="school_id", all=TRUE )
# Check that the results did not change
r <- glm(
(result=="fail") ~ race + school_rev,
data=d3,
family=binomial()
)
summary(r)
Upvotes: 1
Reputation: 18487
You'll need a dataset with information on all students. Both failed and passed.
schools.num = 30
schools.data = data.frame(school_id=seq(1,schools.num)
,tot_white=sample(100:300,schools.num,TRUE)
,tot_black=sample(100:300,schools.num,TRUE)
,tot_asian=sample(100:300,schools.num,TRUE)
,school_rev=sample(4e6:6e6,schools.num,TRUE)
)
library(plyr)
fail_ratio <- 0.05
dataset <- ddply(schools.data, .(school_id, school_rev), function(x){
data.frame(Fail = rbinom(sum(x$tot_white, x$tot_asian, x$tot_black), size = 1, prob = fail_ratio), Race = c(rep("white", x$tot_white), rep("asian", x$tot_asian), rep("black", x$tot_black)))
})
dataset$Race <- factor(dataset$Race)
Then you can use glmer() for the lme4 package for a frequentist approach.
library(lme4)
glmer(Fail ~ school_rev + Race + (1|school_id), data = dataset, family = binomial)
Have a look at the MCMCglmm package if you need Bayesian estimates.
Upvotes: 0