Reputation: 13
I am new to R so this may seem like a basic question; I am trying to estimate a probit regression of being Employed given the individual is a Male. I think I have the probit model correct but I am unable to plot it. Below is the first 10 rows from my dataset which has 60,000 in total. As you can see I have created 5 dummy variables: 'Male', 'LeavingCert', 'Bachelors', 'Married', and 'Employed'; (Although the first 10 rows in the Employed column are 0, this is not the case for the full dataset. However; there are significantly more 0s than 1s and perhaps this is my issue?)
Top10 <- head(data,10)
Top10
# A tibble: 10 × 10
...1 SEX MARSTAT MAINSTAT EDUCLEV4 Male LeavingCert Bachelors Married Employed
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 2 6 9 0 0 0 1 0
2 2 2 1 2 9 0 0 0 0 0
3 3 2 1 2 9 0 0 0 0 0
4 4 2 2 2 9 0 0 0 1 0
5 5 2 1 5 6 0 0 1 0 0
6 6 1 2 3 9 1 0 0 1 0
7 7 1 2 2 9 1 0 0 1 0
8 8 2 3 2 9 0 0 0 0 0
9 9 2 2 2 9 0 0 0 1 0
10 10 1 1 4 9 1 0 0 0 0
For my 'Probit1' model, my Y is 'Employed' and my X is 'Male'. My code is as follows;
Probit1 <- glm(Employed ~ Male,
family = binomial(link = "probit"),
data = data)
summary(Probit1)
I have tried to plot this probit regression as follows;
# plot data
plot(x = data$Male,
y = data$Employed,
main = "Probit Model of the Probability of Employed, Given Male",
xlab = "Male",
ylab = "Employed",
pch = 20,
ylim = c(-0.4, 1.4),
cex.main = 0.85)
# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Empolyed")
text(2.5, -0.1, cex= 0.8, "Unemployed")
# add estimated regression line
x <- seq(0, 3, 0.01)
y <- predict(Probit1, list(Male = x), type = "response")
lines(x, y, lwd = 1.5, col = "steelblue")
This is the plot I am getting and it does not seem correct?
Are my data the problem?
Any help is very much appreciated and if this is not possible, is there another plot I could do? Thanks in advance.
summary(Probit1)
> summary(Probit1)
Call:
glm(formula = Employed ~ Male, family = binomial(link = "probit"),
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.2777 -0.2777 -0.2742 -0.2742 2.5689
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.78787 0.01299 -137.67 <2e-16 ***
Male 0.01141 0.01871 0.61 0.542
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 19747 on 61939 degrees of freedom
Residual deviance: 19747 on 61938 degrees of freedom
AIC: 19751
Number of Fisher Scoring iterations: 5
table(data$Male, data$Employed)
> table(data$Male, data$Employed)
0 1
0 31165 1194
1 28462 1119
Upvotes: 1
Views: 741
Reputation: 4427
Thank you for adding the regression summary and the table. From the regression summary we can see, that the coefficient for male is really small (and not significant):
Male 0.01141 0.01871 0.61 0.542
That small coefficient means, that being male increases the chances of being employed only the tiniest bit so it makes sense, that we cannot see it in the plot, because the change is too small, the slope to little.
Looking at the table
> table(data$Male, data$Employed)
0 1
0 31165 1194
1 28462 1119
We can confirm, that there is no numerically or visually impressive increase:
1194/(1194+31165) = 3,7%
, 1119/(1119+28462) = 3,8%
.
You cannot expect to see a .1% increase in your plot.
Are my data the problem?
They are the reason. I don't know, whether they are a problem.
Upvotes: 1