How to total rows with only certain columns

Question

So I have a data set that is based on HR data training which asks tech and common questions.

The rows represent an employee and the columns represent the score they got on each question. The columns also include demographic data. I only want to see the row total of the tech and common questions though and not include the demographic data.

techs<-grep("^T",rownames(dat))
commons<-grep("^C",rownames(dat))

I used this to try to group the columns together but when I do:

total<-rowsum(commons,techs)

and try to put it in a linear regression:

Mod1Train<-lm(total~.,data=dat[Train,])

it says that there are different variable lengths.

I'm a super newbie to R, so sorry in advance if I'm really off.

Chuck P · Accepted Answer

in the future it would be ever so helpful if you provided a sample of your data. It's hard for us to help when we're guessing about that. Please see this link https://stackoverflow.com/help/minimal-reproducible-example.

Having said that LOL and realizing you're new I'll take a guess...

Let's make pretend data that I imagine is a smaller imaginary version of yours...

set.seed(2020)
emplid <- 1:10
gender <- sample(c("Male", "Female"), size = 10, replace = TRUE)
Tech1 <- sample(10:20, size = 10, replace = TRUE)
Tech2 <- sample(10:20, size = 10, replace = TRUE)
Tech3 <- sample(10:20, size = 10, replace = TRUE)
Common1 <- sample(10:20, size = 10, replace = TRUE)
Common2 <- sample(10:20, size = 10, replace = TRUE)
Common3 <- sample(10:20, size = 10, replace = TRUE)
Kathryn <-  data.frame(emplid, gender, Tech1, Tech2, Tech3, Common1, Common2, Common3)
Kathryn
#>    emplid gender Tech1 Tech2 Tech3 Common1 Common2 Common3
#> 1       1 Female    10    17    15      18      17      15
#> 2       2 Female    17    13    11      20      11      13
#> 3       3   Male    17    11    19      18      10      12
#> 4       4 Female    19    16    15      14      15      16
#> 5       5 Female    11    13    20      20      16      13
#> 6       6   Male    15    11    17      19      17      13
#> 7       7   Male    11    13    11      15      14      11
#> 8       8 Female    12    14    10      11      17      19
#> 9       9 Female    11    13    15      18      11      10
#> 10     10 Female    17    20    12      12      14      15

If you're new may want to invest some time learning the tidyverse which could make this simple like here Efficiently sum across multiple columns in R

Per your note in the comments, you have a pattern we can match for summing questions. You were close with your attempt at grep but we want the values back so we need value = TRUE which we'll store and make use of.

techqs <- grep(x = names(Kathryn), pattern = "^Tech", value = TRUE)
commonqs <- grep(x = names(Kathryn), pattern = "^Common", value = TRUE)
Kathryn$TechScores <- rowSums(Kathryn[,techqs])
Kathryn$CommonScores <- rowSums(Kathryn[,commonqs])

### Commented out how to do it manually.
# Kathryn$TechScores <- rowSums(Kathryn[,c("TQ1", "TQ2", "TQ3")])
# Kathryn$CommonScores <- rowSums(Kathryn[,c("CQ1", "CQ2", "CQ3")])

Kathryn$TotalScore <- Kathryn$TechScores + Kathryn$CommonScores

Now to regress which is where the statistical problem comes in. Are you really trying to predict the total score from the components??? That's not hard in r but it leads to silly answers.

Kathryn_model <- lm(formula = TotalScore ~ TechScores + CommonScores, data = Kathryn)
summary(Kathryn_model)
#> Warning in summary.lm(Kathryn_model): essentially perfect fit: summary may be
#> unreliable
#> 
#> Call:
#> lm(formula = TotalScore ~ TechScores + CommonScores, data = Kathryn)
#> 
#> Residuals:
#>        Min         1Q     Median         3Q        Max 
#> -3.165e-14 -1.905e-15  9.290e-16  8.590e-15  1.183e-14 
#> 
#> Coefficients:
#>               Estimate Std. Error   t value Pr(>|t|)    
#> (Intercept)  8.089e-14  6.345e-14 1.275e+00    0.243    
#> TechScores   1.000e+00  9.344e-16 1.070e+15   <2e-16 ***
#> CommonScores 1.000e+00  1.130e-15 8.853e+14   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.43e-14 on 7 degrees of freedom
#> Multiple R-squared:      1,  Adjusted R-squared:      1 
#> F-statistic: 9.875e+29 on 2 and 7 DF,  p-value: < 2.2e-16

How to total rows with only certain columns

Answers (2)

Related Questions