Reputation: 1265
I have the following dataframe(Note. My sample has over a 100 columns and rows of a hundred)
word1 word2 word3 word4 word5 Score
1 1 1 1 1 10
1 2 3 4 5 16
2 1 0 1 2 13
1 1 1 1 1 15
1 2 3 4 5 16
2 1 0 1 2 18
1 1 1 1 1 10
1 2 3 4 5 16
2 1 0 1 2 13
1 1 1 1 1 15
1 2 3 4 5 16
2 1 0 1 2 18
1 1 1 1 1 10
1 2 3 4 5 16
2 1 0 1 2 13
1 1 1 1 1 15
1 2 3 4 5 16
2 1 0 1 2 18
This is a system of linear equations in many variables. I want to solve the same and get the actual values of word1, word2, word3, word4, etc. Score is predicetd by word1,word2, word3 etc
I have used
lm(Score~., data=DF)
This gives NA values and a few values. I request some help here. Many thanks in advance. is there a reason for the NA values. And is there an alternate approach
Upvotes: 0
Views: 230
Reputation: 6222
fit <- lm(Score ~ ., data = df)
fit
#Call:
#lm(formula = Score ~ ., data = df)
#Coefficients:
#(Intercept) word1 word2 word3 word4 word5
# 6.0 3.0 3.5 NA NA NA
If this is what happens, it must be due to the multi-colinearities in your data. When data has multi-colinearity, lm
is not able to give a unique solution unless it drops some of the variables.
In your case, it is easy to see the presence of multi-colinearities; see below. The word2
and word4
pair are perfectly correlated. There are a few other high-correlation coefficients, too. (NOTE: cor
is not the best way to check for multi-colinearities, as it only checks pair-wise correlations.)
round(cor(df), 2)
# word1 word2 word3 word4 word5 Score
# word1 1.00 -0.50 -0.76 -0.50 -0.28 0.23
# word2 -0.50 1.00 0.94 1.00 0.97 0.37
# word3 -0.76 0.94 1.00 0.94 0.84 0.19
# word4 -0.50 1.00 0.94 1.00 0.97 0.37
# word5 -0.28 0.97 0.84 0.97 1.00 0.47
# Score 0.23 0.37 0.19 0.37 0.47 1.00
Upvotes: 2