Mudathir Bakhit
Mudathir Bakhit

Reputation: 1

Different results with Shapiro Wilk test in R

I have a large data for an independent sample t-test with two factors, one of them is Gender. I want it to check the normality of each group in variable to decide the next step. So I took the following script that I found in this forum with some modifications.

for (i in 9:ncol(AF)) {
  print(names(AF)[i]) 
  print(AF %>%
          group_by(Gender) %>%
          summarise(`W Statistic` = ifelse(sd(AF[, i])!=0,
                                           shapiro.test(AF[, i])$statistic,NA),
                    `p-value` = ifelse(sd(AF[, i])!=0,
                                       shapiro.test(AF[, i])$p.value,NA)))
}

The result for the first variables (R_44) was a follows:

## [1] "R_44"
## # A tibble: 2 × 3
##   Gender Statistic `p-value`
##   <fct>      <dbl>     <dbl>
## 1 F          0.560  9.31e-10
## 2 M          0.560  9.31e-10

This variable at the beginning of my work I remembered doing its normality check using JASP and it was different.

In JASP the result was different:

## 1 F          0.465  1.559e -7
## 2 M          0.623  5.149e -6

I repeated the test in R for the same variable without the loop function as below:

shapiro.test(AF$R_44[AF$Gender == "F"])
shapiro.test(AF$R_44[AF$Gender == "M"])

The results were:

data:  AF$R_44[AF$Gender == "F"]
W = 0.46505, p-value = 1.559e-07

data:  AF$R_44[AF$Gender == "M"]
W = 0.62303, p-value = 5.149e-06

similar to JASP. Therefore, I assume I have a mistake in the first script above but I am not sure where is it. Need help here!

Upvotes: 0

Views: 193

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388817

AF[, i] is subsetting data from the entire dataframe and does not take into consideration the grouping by Gender. You may use cur_data() to subset data from the current group.

Also since sd returns a single value it is better to use if/else instead of vectorised ifelse.

Something like this should work, I can't test this since I don't have the data.

library(dplyr)

for (i in 9:ncol(AF)) {
  print(names(AF)[i]) 
  print(AF %>%
          group_by(Gender) %>%
          summarise(`W Statistic` = if(sd(select(cur_data(), i)) !=0)
                            shapiro.test(cur_data()[[i]])$statistic else NA
                    `p-value` = if(sd(select(cur_data(), i)) !=0)
                            shapiro.test(cur_data()[[i]])$p.value else NA)
}

Upvotes: 1

Related Questions