Shades
Shades

Reputation: 85

Shapiro.test in dplyr on multiple columns at same time

I am trying to run a normality test (shapiro-wilk) on a dataset, and I want the statistic and the p-value for all columns simultaneously. I have read all the other pages on SO (R: Shapiro test by group won't produce p-values and corrupt data frame warning, Using shapiro.test on multiple columns in a data frame) concerning this, but still cant figure it out. Any help would be appreciated!!

Foe eg, here is the dataset: with one character vector (NVL) and rest numeric and i want to group by NVL (NV/VL).

     NVL  Var1  Var2  Var3  Var 4  Var 5
1.   NV   22.5  26.8   89.2  35.7   100
2.   NV   34.7  67.4   29.8  12.4   100
3.   NV   68.3  34.5   44.5  23.8   100
4.   NV   11.2  55.3   17.5  77.9   100
5.   VL   55.6  77.2   59.7  89.6   100
6.   VL   60.5  88.7   65.4  99.6   100
7.   VL   89.4  87.5   65.9  89.5   100
8.   VL   65.4  74.2   75.4  89.5   100
9.   VL   81.8  78.5   95.4  92.5   100

Here is the code:

library(dplyr)
normalityVar1<-mydata %>%
group_by(NVL) %>%
summarise(statistic = shapiro.test(Var1)$statistic, 
p.value = shapiro.test(Var1)$p.value)

Here is the output:

NVL statistic   p.value
  <chr>     <dbl>     <dbl>
1    VL 0.9125239 0.1985486
2    NV 0.8983501 0.2101248

Now, do i edit this code, so that I can get this output for all the variables (Var2, 3, 4 ,5) at the same time? I even tried aggregate and sapply, but I am stuck.

aggregate(formula = Var1 ~ NVL,
data = mydata,
FUN = function(x) {y <- shapiro.test(x); c(y$statistic, y$p.value)}) 

As you can see, I am able to do this for only one variable! I know i am close, but i just cant figure it out anymore!! Thanks in advance for any help!!

Upvotes: 5

Views: 11958

Answers (4)

Agile Bean
Agile Bean

Reputation: 7151

Based on the excellent answer by @GegznaV, here's an update with the newer constructs tidyr::pivot_longer replacing tidyr::gather, and the nest-unnest syntax.

In my experience, broom::glance is more reliable to give the statistics of a test than broom::tidy but you can try both.

library(tidyverse)
library(broom)

data %>% 
  pivot_longer(
    cols = -NVL,
    names_to = "variable_name",
    values_to = "value"
  ) %>% 
  nest(data = -c(variable_name, NVL)) %>% 
  mutate(
    shapiro = map(data, ~shapiro.test(.x$value)),
    glanced = map(shapiro, glance),
    tidied = map(shapiro, tidy)
  ) %>% 
  unnest(glanced) %>% 
  select(NVL, variable_name, W = statistic, p.value) 

Upvotes: 0

GegznaV
GegznaV

Reputation: 5580

I suggest constructing a long-format ("tidy") dataset and using the tidiverse functions.

Load packages:

library(dplyr)      # for data manipulation functions
library(tidyr)      # for data manipulation functions
library(data.table) # for function `fread`
library(broom)      # for function `tidy`

Read data into R:

data <- fread(
"NVL   Var1  Var2   Var3  Var4   Var5
  NV   22.5  26.8   89.2  35.7   100
  NV   34.7  67.4   29.8  12.4   100
  NV   68.3  34.5   44.5  23.8   50
  NV   11.2  55.3   17.5  77.9   100
  VL   55.6  77.2   59.7  89.6   100
  VL   60.5  88.7   65.4  99.6   100
  VL   89.4  87.5   65.9  89.5   100
  VL   65.4  74.2   75.4  89.5   90
  VL   81.8  78.5   95.4  92.5   90")

Do the analysis:

# 1. Gather the variables that values should be tested.
# 2. Group by variable with variable names (`variable_name`) and 
#    by all group variables (in our case `NVL`).
# 3. Do the test for `value` and tidy the result.
# 4. Ungroup (it's a good practice to do this). 
# 5. Remove unnecessary information (column `method`).

sw_test_results <- data %>% 
    gather(key = "variable_name", value = "value", Var1:Var5) %>% 
    group_by(variable_name, NVL)  %>% 
    do(tidy(shapiro.test(.$value))) %>% 
    ungroup() %>% 
    select(-method)

sw_test_results

The results:

# A tibble: 10 x 4
   variable_name NVL   statistic p.value
   <chr>         <chr>     <dbl>   <dbl>
 1 Var1          NV        0.931 0.602  
 2 Var1          VL        0.915 0.498  
 3 Var2          NV        0.941 0.660  
 4 Var2          VL        0.874 0.282  
 5 Var3          NV        0.910 0.480  
 6 Var3          VL        0.864 0.245  
 7 Var4          NV        0.900 0.433  
 8 Var4          VL        0.726 0.0176 
 9 Var5          NV        0.630 0.00124
10 Var5          VL        0.684 0.00647

Upvotes: 4

Marco Sandri
Marco Sandri

Reputation: 24262

mydata <- read.table(text="
   NVL  Var1  Var2  Var3  Var4  Var5
1   NV   22.5  26.8   89.2  35.7   100
2   NV   34.7  67.4   29.8  12.4   100
3   NV   68.3  34.5   44.5  23.8   50
4   NV   11.2  55.3   17.5  77.9   100
5   VL   55.6  77.2   59.7  89.6   100
6   VL   60.5  88.7   65.4  99.6   100
7   VL   89.4  87.5   65.9  89.5   100
8   VL   65.4  74.2   75.4  89.5   90
9   VL   81.8  78.5   95.4  92.5   90
", header=T)

library(dplyr)
myfun <- function(x, group) {
  data.frame(x, group) %>%
  group_by(group) %>%
  summarise(
    statistic = ifelse(sd(x)!=0,shapiro.test(x)$statistic,NA), 
    p.value = ifelse(sd(x)!=0,shapiro.test(x)$p.value,NA)
  )
}
(lst <- lapply(mydata[,-1], myfun, group=mydata[,1]))

The output is:

$Var1
# A tibble: 2 x 3
   group statistic   p.value
  <fctr>     <dbl>     <dbl>
1     NV 0.9313476 0.6023421
2     VL 0.9149572 0.4979450

$Var2
# A tibble: 2 x 3
   group statistic   p.value
  <fctr>     <dbl>     <dbl>
1     NV 0.9409576 0.6601747
2     VL 0.8736587 0.2815562

$Var3
# A tibble: 2 x 3
   group statistic   p.value
  <fctr>     <dbl>     <dbl>
1     NV 0.9096322 0.4804557
2     VL 0.8644349 0.2446131

$Var4
# A tibble: 2 x 3
   group statistic    p.value
  <fctr>     <dbl>      <dbl>
1     NV 0.9003135 0.43261822
2     VL 0.7260939 0.01760713

$Var5
# A tibble: 2 x 3
   group statistic     p.value
  <fctr>     <dbl>       <dbl>
1     NV 0.6297763 0.001240726
2     VL 0.6840289 0.006470001

The lst output list can be transformed into a data.frame object:

do.call(cbind, lst)[,-seq(4,3*(ncol(mydata)-1),3)]

Here is the output:

  Var1.group Var1.statistic Var1.p.value Var2.statistic Var2.p.value Var3.statistic Var3.p.value Var4.statistic Var4.p.value Var5.statistic Var5.p.value
1         NV      0.9313476    0.6023421      0.9409576    0.6601747      0.9096322    0.4804557      0.9003135   0.43261822      0.6297763  0.001240726
2         VL      0.9149572    0.4979450      0.8736587    0.2815562      0.8644349    0.2446131      0.7260939   0.01760713      0.6840289  0.006470001

Upvotes: 2

wici
wici

Reputation: 1711

Simply use summarise_all:

mydata <- read.table(text="
   NVL  Var1  Var2  Var3  Var4  Var5
1   NV   22.5  26.8   89.2  35.7   100
2   NV   34.7  67.4   29.8  12.4   100
3   NV   68.3  34.5   44.5  23.8   50
4   NV   11.2  55.3   17.5  77.9   100
5   VL   55.6  77.2   59.7  89.6   100
6   VL   60.5  88.7   65.4  99.6   100
7   VL   89.4  87.5   65.9  89.5   100
8   VL   65.4  74.2   75.4  89.5   90
9   VL   81.8  78.5   95.4  92.5   90
", header=T)


library(dplyr)
normalityVar1<-mydata %>%
  group_by(NVL) %>%
  summarise_all(.funs = funs(statistic = shapiro.test(.)$statistic, 
                             p.value = shapiro.test(.)$p.value))

With the desired output:

normalityVar1
# A tibble: 2 x 11
    NVL Var1_statistic Var2_statistic Var3_statistic Var4_statistic Var5_statistic Var1_p.value Var2_p.value Var3_p.value
  <fctr>          <dbl>          <dbl>          <dbl>          <dbl>          <dbl>        <dbl>        <dbl>        <dbl>
1     NV      0.9313476      0.9409576      0.9096322      0.9003135      0.6297763    0.6023421    0.6601747    0.4804557
2     VL      0.9149572      0.8736587      0.8644349      0.7260939      0.6840289    0.4979450    0.2815562    0.2446131
# ... with 2 more variables: Var4_p.value <dbl>, Var5_p.value <dbl>

Note that you first have all statistics and then all p-values. Reordering of the columns - if necessary - should be simple.

Upvotes: 4

Related Questions