Reputation: 13
So I have a data set that is based on HR data training which asks tech and common questions.
The rows represent an employee and the columns represent the score they got on each question. The columns also include demographic data. I only want to see the row total of the tech and common questions though and not include the demographic data.
techs<-grep("^T",rownames(dat))
commons<-grep("^C",rownames(dat))
I used this to try to group the columns together but when I do:
total<-rowsum(commons,techs)
and try to put it in a linear regression:
Mod1Train<-lm(total~.,data=dat[Train,])
it says that there are different variable lengths.
I'm a super newbie to R, so sorry in advance if I'm really off.
Upvotes: 1
Views: 91
Reputation: 3923
in the future it would be ever so helpful if you provided a sample of your data. It's hard for us to help when we're guessing about that. Please see this link https://stackoverflow.com/help/minimal-reproducible-example.
Having said that LOL and realizing you're new I'll take a guess...
Let's make pretend data that I imagine is a smaller imaginary version of yours...
set.seed(2020)
emplid <- 1:10
gender <- sample(c("Male", "Female"), size = 10, replace = TRUE)
Tech1 <- sample(10:20, size = 10, replace = TRUE)
Tech2 <- sample(10:20, size = 10, replace = TRUE)
Tech3 <- sample(10:20, size = 10, replace = TRUE)
Common1 <- sample(10:20, size = 10, replace = TRUE)
Common2 <- sample(10:20, size = 10, replace = TRUE)
Common3 <- sample(10:20, size = 10, replace = TRUE)
Kathryn <- data.frame(emplid, gender, Tech1, Tech2, Tech3, Common1, Common2, Common3)
Kathryn
#> emplid gender Tech1 Tech2 Tech3 Common1 Common2 Common3
#> 1 1 Female 10 17 15 18 17 15
#> 2 2 Female 17 13 11 20 11 13
#> 3 3 Male 17 11 19 18 10 12
#> 4 4 Female 19 16 15 14 15 16
#> 5 5 Female 11 13 20 20 16 13
#> 6 6 Male 15 11 17 19 17 13
#> 7 7 Male 11 13 11 15 14 11
#> 8 8 Female 12 14 10 11 17 19
#> 9 9 Female 11 13 15 18 11 10
#> 10 10 Female 17 20 12 12 14 15
If you're new may want to invest some time learning the tidyverse
which could make this simple like here Efficiently sum across multiple columns in R
Per your note in the comments, you have a pattern we can match for summing questions. You were close with your attempt at grep
but we want the values back so we need value = TRUE
which we'll store and make use of.
techqs <- grep(x = names(Kathryn), pattern = "^Tech", value = TRUE)
commonqs <- grep(x = names(Kathryn), pattern = "^Common", value = TRUE)
Kathryn$TechScores <- rowSums(Kathryn[,techqs])
Kathryn$CommonScores <- rowSums(Kathryn[,commonqs])
### Commented out how to do it manually.
# Kathryn$TechScores <- rowSums(Kathryn[,c("TQ1", "TQ2", "TQ3")])
# Kathryn$CommonScores <- rowSums(Kathryn[,c("CQ1", "CQ2", "CQ3")])
Kathryn$TotalScore <- Kathryn$TechScores + Kathryn$CommonScores
Now to regress which is where the statistical problem comes in. Are you really trying to predict the total score from the components??? That's not hard in r
but it leads to silly answers.
Kathryn_model <- lm(formula = TotalScore ~ TechScores + CommonScores, data = Kathryn)
summary(Kathryn_model)
#> Warning in summary.lm(Kathryn_model): essentially perfect fit: summary may be
#> unreliable
#>
#> Call:
#> lm(formula = TotalScore ~ TechScores + CommonScores, data = Kathryn)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.165e-14 -1.905e-15 9.290e-16 8.590e-15 1.183e-14
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 8.089e-14 6.345e-14 1.275e+00 0.243
#> TechScores 1.000e+00 9.344e-16 1.070e+15 <2e-16 ***
#> CommonScores 1.000e+00 1.130e-15 8.853e+14 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.43e-14 on 7 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 9.875e+29 on 2 and 7 DF, p-value: < 2.2e-16
Upvotes: 1
Reputation: 755
I don't understand your code and what you search for
In lm, there should be something like
lm(y~x,data=adataframe)
"adataframe" is the eventual dataframe/matrix where lm seeks both the response and the input variable,named "y" and "x" here. It is optional. If not found, y and x are seeked in the Global Env as if the columns names are not found in data, they are seeked in the Global environment. It is sometimes better however, to have such a matrix-like object, to avoid common errors.
So if you want to use lm, maybe you should first try to obtain 2 vectors, one for x and one for y, have them in a data.frame with 2 columns (x and y), and call the code above, if I have correctly understood
Note : if you want to remove the constant, use then
lm(y~x+0,data=adataframe)
Upvotes: 0