Reputation: 85
I am trying to run a normality test (shapiro-wilk) on a dataset, and I want the statistic and the p-value for all columns simultaneously. I have read all the other pages on SO (R: Shapiro test by group won't produce p-values and corrupt data frame warning, Using shapiro.test on multiple columns in a data frame) concerning this, but still cant figure it out. Any help would be appreciated!!
Foe eg, here is the dataset: with one character vector (NVL) and rest numeric and i want to group by NVL (NV/VL).
NVL Var1 Var2 Var3 Var 4 Var 5
1. NV 22.5 26.8 89.2 35.7 100
2. NV 34.7 67.4 29.8 12.4 100
3. NV 68.3 34.5 44.5 23.8 100
4. NV 11.2 55.3 17.5 77.9 100
5. VL 55.6 77.2 59.7 89.6 100
6. VL 60.5 88.7 65.4 99.6 100
7. VL 89.4 87.5 65.9 89.5 100
8. VL 65.4 74.2 75.4 89.5 100
9. VL 81.8 78.5 95.4 92.5 100
Here is the code:
library(dplyr)
normalityVar1<-mydata %>%
group_by(NVL) %>%
summarise(statistic = shapiro.test(Var1)$statistic,
p.value = shapiro.test(Var1)$p.value)
Here is the output:
NVL statistic p.value
<chr> <dbl> <dbl>
1 VL 0.9125239 0.1985486
2 NV 0.8983501 0.2101248
Now, do i edit this code, so that I can get this output for all the variables (Var2, 3, 4 ,5) at the same time? I even tried aggregate and sapply, but I am stuck.
aggregate(formula = Var1 ~ NVL,
data = mydata,
FUN = function(x) {y <- shapiro.test(x); c(y$statistic, y$p.value)})
As you can see, I am able to do this for only one variable! I know i am close, but i just cant figure it out anymore!! Thanks in advance for any help!!
Upvotes: 5
Views: 11958
Reputation: 7151
Based on the excellent answer by @GegznaV,
here's an update with the newer constructs tidyr::pivot_longer
replacing tidyr::gather
, and the nest-unnest
syntax.
In my experience, broom::glance
is more reliable to give the statistics of a test than broom::tidy
but you can try both.
library(tidyverse)
library(broom)
data %>%
pivot_longer(
cols = -NVL,
names_to = "variable_name",
values_to = "value"
) %>%
nest(data = -c(variable_name, NVL)) %>%
mutate(
shapiro = map(data, ~shapiro.test(.x$value)),
glanced = map(shapiro, glance),
tidied = map(shapiro, tidy)
) %>%
unnest(glanced) %>%
select(NVL, variable_name, W = statistic, p.value)
Upvotes: 0
Reputation: 5580
I suggest constructing a long-format ("tidy") dataset and using the tidiverse
functions.
Load packages:
library(dplyr) # for data manipulation functions
library(tidyr) # for data manipulation functions
library(data.table) # for function `fread`
library(broom) # for function `tidy`
Read data into R
:
data <- fread(
"NVL Var1 Var2 Var3 Var4 Var5
NV 22.5 26.8 89.2 35.7 100
NV 34.7 67.4 29.8 12.4 100
NV 68.3 34.5 44.5 23.8 50
NV 11.2 55.3 17.5 77.9 100
VL 55.6 77.2 59.7 89.6 100
VL 60.5 88.7 65.4 99.6 100
VL 89.4 87.5 65.9 89.5 100
VL 65.4 74.2 75.4 89.5 90
VL 81.8 78.5 95.4 92.5 90")
Do the analysis:
# 1. Gather the variables that values should be tested.
# 2. Group by variable with variable names (`variable_name`) and
# by all group variables (in our case `NVL`).
# 3. Do the test for `value` and tidy the result.
# 4. Ungroup (it's a good practice to do this).
# 5. Remove unnecessary information (column `method`).
sw_test_results <- data %>%
gather(key = "variable_name", value = "value", Var1:Var5) %>%
group_by(variable_name, NVL) %>%
do(tidy(shapiro.test(.$value))) %>%
ungroup() %>%
select(-method)
sw_test_results
The results:
# A tibble: 10 x 4
variable_name NVL statistic p.value
<chr> <chr> <dbl> <dbl>
1 Var1 NV 0.931 0.602
2 Var1 VL 0.915 0.498
3 Var2 NV 0.941 0.660
4 Var2 VL 0.874 0.282
5 Var3 NV 0.910 0.480
6 Var3 VL 0.864 0.245
7 Var4 NV 0.900 0.433
8 Var4 VL 0.726 0.0176
9 Var5 NV 0.630 0.00124
10 Var5 VL 0.684 0.00647
Upvotes: 4
Reputation: 24262
mydata <- read.table(text="
NVL Var1 Var2 Var3 Var4 Var5
1 NV 22.5 26.8 89.2 35.7 100
2 NV 34.7 67.4 29.8 12.4 100
3 NV 68.3 34.5 44.5 23.8 50
4 NV 11.2 55.3 17.5 77.9 100
5 VL 55.6 77.2 59.7 89.6 100
6 VL 60.5 88.7 65.4 99.6 100
7 VL 89.4 87.5 65.9 89.5 100
8 VL 65.4 74.2 75.4 89.5 90
9 VL 81.8 78.5 95.4 92.5 90
", header=T)
library(dplyr)
myfun <- function(x, group) {
data.frame(x, group) %>%
group_by(group) %>%
summarise(
statistic = ifelse(sd(x)!=0,shapiro.test(x)$statistic,NA),
p.value = ifelse(sd(x)!=0,shapiro.test(x)$p.value,NA)
)
}
(lst <- lapply(mydata[,-1], myfun, group=mydata[,1]))
The output is:
$Var1
# A tibble: 2 x 3
group statistic p.value
<fctr> <dbl> <dbl>
1 NV 0.9313476 0.6023421
2 VL 0.9149572 0.4979450
$Var2
# A tibble: 2 x 3
group statistic p.value
<fctr> <dbl> <dbl>
1 NV 0.9409576 0.6601747
2 VL 0.8736587 0.2815562
$Var3
# A tibble: 2 x 3
group statistic p.value
<fctr> <dbl> <dbl>
1 NV 0.9096322 0.4804557
2 VL 0.8644349 0.2446131
$Var4
# A tibble: 2 x 3
group statistic p.value
<fctr> <dbl> <dbl>
1 NV 0.9003135 0.43261822
2 VL 0.7260939 0.01760713
$Var5
# A tibble: 2 x 3
group statistic p.value
<fctr> <dbl> <dbl>
1 NV 0.6297763 0.001240726
2 VL 0.6840289 0.006470001
The lst
output list can be transformed into a data.frame
object:
do.call(cbind, lst)[,-seq(4,3*(ncol(mydata)-1),3)]
Here is the output:
Var1.group Var1.statistic Var1.p.value Var2.statistic Var2.p.value Var3.statistic Var3.p.value Var4.statistic Var4.p.value Var5.statistic Var5.p.value
1 NV 0.9313476 0.6023421 0.9409576 0.6601747 0.9096322 0.4804557 0.9003135 0.43261822 0.6297763 0.001240726
2 VL 0.9149572 0.4979450 0.8736587 0.2815562 0.8644349 0.2446131 0.7260939 0.01760713 0.6840289 0.006470001
Upvotes: 2
Reputation: 1711
Simply use summarise_all
:
mydata <- read.table(text="
NVL Var1 Var2 Var3 Var4 Var5
1 NV 22.5 26.8 89.2 35.7 100
2 NV 34.7 67.4 29.8 12.4 100
3 NV 68.3 34.5 44.5 23.8 50
4 NV 11.2 55.3 17.5 77.9 100
5 VL 55.6 77.2 59.7 89.6 100
6 VL 60.5 88.7 65.4 99.6 100
7 VL 89.4 87.5 65.9 89.5 100
8 VL 65.4 74.2 75.4 89.5 90
9 VL 81.8 78.5 95.4 92.5 90
", header=T)
library(dplyr)
normalityVar1<-mydata %>%
group_by(NVL) %>%
summarise_all(.funs = funs(statistic = shapiro.test(.)$statistic,
p.value = shapiro.test(.)$p.value))
With the desired output:
normalityVar1
# A tibble: 2 x 11
NVL Var1_statistic Var2_statistic Var3_statistic Var4_statistic Var5_statistic Var1_p.value Var2_p.value Var3_p.value
<fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NV 0.9313476 0.9409576 0.9096322 0.9003135 0.6297763 0.6023421 0.6601747 0.4804557
2 VL 0.9149572 0.8736587 0.8644349 0.7260939 0.6840289 0.4979450 0.2815562 0.2446131
# ... with 2 more variables: Var4_p.value <dbl>, Var5_p.value <dbl>
Note that you first have all statistics and then all p-values. Reordering of the columns - if necessary - should be simple.
Upvotes: 4