Reputation: 8506
I am reproducing some Stata code on R and I would like to perform a multinomial logistic regression with the mlogit
function, from the package of the same name (I know that there is a multinom
function in nnet
but I don't want to use this one).
My problem is that, to use mlogit
, I need my data to be formatted using mlogit.data
and I can't figure out how to format it properly. Comparing my data to the data used in the examples in the documentation and in this question, I realize that it is not in the same form.
Indeed, the data I use is like:
df <- data.frame(ID = seq(1, 10),
type = c(2, 3, 4, 2, 1, 1, 4, 1, 3, 2),
age = c(28, 31, 12, 1, 49, 80, 36, 53, 22, 10),
dum1 = c(1, 0, 0, 0, 0, 1, 0, 1, 1, 0),
dum2 = c(1, 0, 1, 1, 0, 0, 1, 0, 1, 0))
ID type age dum1 dum2
1 1 2 28 1 1
2 2 3 31 0 0
3 3 4 12 0 1
4 4 2 1 0 1
5 5 1 49 0 0
6 6 1 80 1 0
7 7 4 36 0 1
8 8 1 53 1 0
9 9 3 22 1 1
10 10 2 10 0 0
whereas the data they use is like:
key altkey A B C D
1 201005131 1 2.6 118.17 117 0
2 201005131 2 1.4 117.11 115 0
3 201005131 3 1.1 117.38 122 1
4 201005131 4 24.6 NA 122 0
5 201005131 5 48.6 91.90 122 0
6 201005131 6 59.8 NA 122 0
7 201005132 1 20.2 118.23 113 0
8 201005132 2 2.5 123.67 120 1
9 201005132 3 7.4 116.30 120 0
10 201005132 4 2.8 118.86 120 0
11 201005132 5 6.9 124.72 120 0
12 201005132 6 2.5 123.81 120 0
As you can see, in their case, there is a column altkey
that details every category for each key
and there is also a column D
showing which alternative is chosen by the person.
However, I only have one column (type
) which shows the choice of the individual but does not show the other alternatives or the value of the other variables for each of these alternatives. When I try to apply mlogit
, I have:
library(mlogit)
mlogit(type ~ age + dum1 + dum2, df)
Error in data.frame(lapply(index, function(x) x[drop = TRUE]), row.names = rownames(mydata)) : row names supplied are of the wrong length
Therefore, how can I format my data so that it corresponds to the type of data mlogit
requires?
Edit: following the advices of @edsandorf, I modified my dataframe and mlogit.data
works but now all the other explanatory variables have the same value for each alternative. Should I set these variables at 0 in the rows where the chosen alternative is 0 or FALSE ? (in fact, can somebody show me the procedure from where I am to the results of the mlogit
because I don't get where I'm wrong for the estimation?)
The data I show here (df
) is not my true data. However, it is exactly the same form: a column with the choice of the alternative (type
), columns with dummies and age, etc.
Here's the procedure I've made so far (I did not set the alternatives to 0):
# create a dataframe with all alternatives for each ID
qqch <- data.frame(ID = rep(df$ID, each = 4),
choice = rep(1:4, 10))
# merge both dataframes
df2 <- dplyr::left_join(qqch, df, by = "ID")
# change the values in stype by 1 or 0
for (i in 1:length(df2$ID)){
df2[i, "type"] <- ifelse(df2[i, "type"] == df2[i, "choice"], 1, 0)
}
# format for mlogit
df3 <- mlogit.data(df2, choice = "type", shape = "long", alt.var = "choice")
head(df3)
ID choice type age dum1 dum2
1.1 1 1 FALSE 28 1 1
1.2 1 2 TRUE 28 1 1
1.3 1 3 FALSE 28 1 1
1.4 1 4 FALSE 28 1 1
2.1 2 1 FALSE 31 0 0
2.2 2 2 FALSE 31 0 0
If I do :
mlogit(type ~ age + dum1 + dum2, df3)
I have the error:
Error in solve.default(H, g[!fixed]) : system is computationally singular: reciprocal condition number
Upvotes: 2
Views: 2110
Reputation: 767
Your data doesn't lend itself well to be estimated using an MNL model unless we make more assumptions. In general, since all your variables are individual specific and does not vary across alternatives (types), the model cannot be identified. All of your individual specific characteristics will drop out unless we treat them as alternative specific. By the sounds of it, each professional program carries meaning in an of itself. In that case, we could estimate the MNL model using constants only, where the constant captures everything about the program that makes an individual choose it.
library(mlogit)
df <- data.frame(ID = seq(1, 10),
type = c(2, 3, 4, 2, 1, 1, 4, 1, 3, 2),
age = c(28, 31, 12, 1, 49, 80, 36, 53, 22, 10),
dum1 = c(1, 0, 0, 0, 0, 1, 0, 1, 1, 0),
dum2 = c(1, 0, 1, 1, 0, 0, 1, 0, 1, 0))
Now, just to be on the safe side, I create dummy variables for each of the programs. type_1
refers to program 1, type_2
to program 2 etc.
qqch <- data.frame(ID = rep(df$ID, each = 4),
choice = rep(1:4, 10))
# merge both dataframes
df2 <- dplyr::left_join(qqch, df, by = "ID")
# change the values in stype by 1 or 0
for (i in 1:length(df2$ID)){
df2[i, "type"] <- ifelse(df2[i, "type"] == df2[i, "choice"], 1, 0)
}
# Add alternative specific variables (here only constants)
df2$type_1 <- ifelse(df2$choice == 1, 1, 0)
df2$type_2 <- ifelse(df2$choice == 2, 1, 0)
df2$type_3 <- ifelse(df2$choice == 3, 1, 0)
df2$type_4 <- ifelse(df2$choice == 4, 1, 0)
# format for mlogit
df3 <- mlogit.data(df2, choice = "type", shape = "long", alt.var = "choice")
head(df3)
Now we can run the model. I include the dummies for each of the alternatives keeping alternative 4 as my reference level. Only J-1 constants are identified, where J is the number of alternatives. In the second half of the formula (after the pipe operator), I make sure that I remove all alternative specific constants that the model would have created and I add your individual specific variables, treating them as alternative specific. Note that this only makes sense if your alternatives (programs) carry meaning and are not generic.
model <- mlogit(type ~ type_1 + type_2 + type_3 | -1 + age + dum1 + dum2,
reflevel = 4, data = df3)
summary(model)
Upvotes: 2