Result of `na.omit()` not behaving the same as pre-cleaned dataset when using `lapply()`

Question

I am trying to run wilcox.test() on several subsets of data using the lapply() function. The data is grouped in the first column of my data frame by a text identifier (site name), and there are two other columns for data from 2013 and 2017 on which I'm running the wilcox test. About 10% of my 500 rows of data have a missing value in either the 2013 or 2017 (or both) columns.

When I try to run the lapply() function shown below, I get the error:

df<-read.csv("myfile.csv",header=T)

split.df<-split(df,df$Site)
lapply(split.df, function(g) wilcox.test(g$2013, g$2017, paired=T)

Error in wilcox.test.default(g$2013, g$2017, Paired = T) : not enough (finite) 'x' observations `

I have tried cleaning the data frame of n/a's using the na.omit() and na.exclude() functions:

df<-na.omit(df)

OR

df<-na.exclude(df)

When running the same split, followed by lapply as written above after either omitting or excluding the NA's, I get the same error.

If I clean the data up in Excel before importing by removing all rows with missing values in either the 2013 or 2017 row, and then import the data, the lapply() function runs correctly.

I am using RStudio, and I have looked at the dataframe at each step. After importing the raw data I have 500 observations. After using either the na.omit() or na.exclude() the dataframe still shows 500 rows, but the rows which had NA values are 'masked' in that the row number is skipped. For example, if rows 5, 8 and 10 had NA's in them, the cleaned dataframe would show rows 1,2,3,4,6,9,11....and so on. If I directly compare row 12 of the cleaned dataset and raw dataset they will be the same values (hence why I think the na.omit() and na.exclude() are simply hiding or masking the rows with NA's).

When I import the dataframe after cleaning it up in excel first, I see that there are truly only 450 rows. I think the error from the lapply() function may be because the na.omit() and na.exclude() aren't actually removing those rows from the dataframe.

Is there a function to truly delete the rows with NA's, or am I totally on the wrong path here? Any tips are appreciated.

Edit:

Example data here: https://1drv.ms/u/s!Av1rL-HNLDNsgZ84P86y953iCXxnjA

Example code, which gives the aforementioned error:

names(df)
df.split<-split(df, df$Site)
df.split
lapply(df.split, function(g) wilcox.test(g$y2013, g$y2017, paired=T))

If the linked csv file is cleaned up manually by removing rows with missing values, the above code works correctly, with the following output:

$D03

Wilcoxon signed rank test

data: g$y2013 and g$y2017 V = 220, p-value = 0.01681 alternative hypothesis: true location shift is not equal to 0

$D04

Wilcoxon signed rank test

data: g$y2013 and g$y2017 V = 158, p-value = 0.0008411 alternative hypothesis: true location shift is not equal to 0

$D08

Wilcoxon signed rank test

data: g$y2013 and g$y2017 V = 96, p-value = 1.146e-05 alternative hypothesis: true location shift is not equal to 0

$D09

Wilcoxon signed rank test

data: g$y2013 and g$y2017 V = 44, p-value = 0.0002089 alternative hypothesis: true location shift is not equal to 0

$D11

Wilcoxon signed rank test

data: g$y2013 and g$y2017 V = 153, p-value = 0.0006289 alternative hypothesis: true location shift is not equal to 0

$Platform1

Wilcoxon signed rank test

data: g$y2013 and g$y2017 V = 285, p-value = 0.05974 alternative hypothesis: true location shift is not equal to 0

$Platform2

Wilcoxon signed rank test

data: g$y2013 and g$y2017 V = 43, p-value = 0.002726 alternative hypothesis: true location shift is not equal to 0

$Platform3

Wilcoxon signed rank test

data: g$y2013 and g$y2017 V = 127, p-value = 0.002817 alternative hypothesis: true location shift is not equal to 0

Ronak Shah · Accepted Answer

Use complete.cases to remove NA values and then apply the test

df1 <- df[complete.cases(df), ]
df.split <- split(df1, df1$Site)

lapply(df.split, function(g) wilcox.test(g$y2013, g$y2017, paired=TRUE))

#$D03

#   Wilcoxon signed rank test

#data:  g$y2013 and g$y2017
#V = 220, p-value = 0.01681
#alternative hypothesis: true location shift is not equal to 0


#$D04

#   Wilcoxon signed rank test

#data:  g$y2013 and g$y2017
#V = 158, p-value = 0.0008411
#alternative hypothesis: true location shift is not equal to 0

#...
#...

Result of `na.omit()` not behaving the same as pre-cleaned dataset when using `lapply()`

Answers (1)

Related Questions