jwin
jwin

Reputation: 41

Determine percentage of rows with missing values in a dataframe in R

I have a data frame with three variables and some missing values in one of the variables that looks like this:

subject <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
part <- c(0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3)
sad <- c(1,7,7,4,NA,NA,2,2,NA,2,3,NA,NA,2,2,1,NA,5,NA,6,6,NA,NA,3,3,NA,NA,5,3,NA,7,2)

df1 <- data.frame(subject,part,sad)

I have created a new data frame with the mean values of 'sad' per subject and part using a loop, like this:

columns<-c("sad.m",
           "part", 
           "subject")

df2<-matrix(data=NA,nrow=1,ncol=length(columns))
df2<-data.frame(df2)
names(df2)<-columns

tn<-unique(df1$subject)

row=1

for (s in tn){
  for (i in 0:3){
    TN<-df1[df1$subject==s&df1$part==i,]
    df2[row,"sad.m"]<-mean(as.numeric(TN$sad), na.rm = TRUE)
    df2[row,"part"]<-i 
    df2[row,"subject"]<-s 
    row=row+1
  }
  
}

Now I want to include an additional variable 'missing' in that indicates the percentage of rows per subject and part with missing values, so that I get df3:

subject <- c(1,1,1,1,2,2,2,2)
part<-c(0,1,2,3,0,1,2,3)
sad.m<-df2$sad.m
missing <- c(0,50,50,25,50,50,50,25)

df3 <- data.frame(subject,part,sad.m,missing)

I'd really appreciate any help on how to go about this!

Upvotes: 1

Views: 1289

Answers (3)

user438383
user438383

Reputation: 6206

It's best to try and avoid loops in R where possible, since they can get messy and tend to be quite slow. For this sort of thing, dplyr library is perfect and well worth learning. It can save you a lot of time.

You can create a data frame with both variables by first grouping by subject and part, and then performing a summary of the grouped data frame:

df2 = df1 %>% 
    dplyr::group_by(subject, part) %>%
    dplyr::summarise(
        sad_mean = mean(na.omit(sad)),
        na_count = (sum(is.na(sad) / n()) * 100)
    )
df2
# A tibble: 8 x 4
# Groups:   subject [2]
  subject  part sad_mean na_count
    <dbl> <dbl>    <dbl>    <dbl>
1       1     0     4.75        0
2       1     1     2          50
3       1     2     2.5        50
4       1     3     1.67       25
5       2     0     5.5        50
6       2     1     4.5        50
7       2     2     4          50
8       2     3     4          25

Upvotes: 4

Ronak Shah
Ronak Shah

Reputation: 388807

For each subject and part you can calculate mean of sad and calculate ratio of NA value using is.na and mean.

library(dplyr)
df1 %>%
  group_by(subject, part) %>%
  summarise(sad.m = mean(sad, na.rm = TRUE), 
            perc_missing = mean(is.na(sad)) * 100)

#  subject  part sad.m perc_missing
#    <dbl> <dbl> <dbl>        <dbl>
#1       1     0  4.75            0
#2       1     1  2              50
#3       1     2  2.5            50
#4       1     3  1.67           25
#5       2     0  5.5            50
#6       2     1  4.5            50
#7       2     2  4              50
#8       2     3  4              25

Same logic with data.table :

library(data.table)

setDT(df1)[, .(sad.m = mean(sad, na.rm = TRUE), 
               perc_missing = mean(is.na(sad)) * 100), .(subject, part)]

Upvotes: 2

Duck
Duck

Reputation: 39585

Try this dplyr approach to compute df3:

library(dplyr)
#Code
df3 <- df1 %>% group_by(subject,part) %>% summarise(N=100*length(which(is.na(sad)))/length(sad))

Output:

# A tibble: 8 x 3
# Groups:   subject [2]
  subject  part     N
    <dbl> <dbl> <dbl>
1       1     0     0
2       1     1    50
3       1     2    50
4       1     3    25
5       2     0    50
6       2     1    50
7       2     2    50
8       2     3    25

And for full interaction with df2 you can use left_join():

#Left join
df3 <- df1 %>% group_by(subject,part) %>%
  summarise(N=100*length(which(is.na(sad)))/length(sad)) %>%
  left_join(df2)

Output:

# A tibble: 8 x 4
# Groups:   subject [2]
  subject  part     N sad.m
    <dbl> <dbl> <dbl> <dbl>
1       1     0     0  4.75
2       1     1    50  2   
3       1     2    50  2.5 
4       1     3    25  1.67
5       2     0    50  5.5 
6       2     1    50  4.5 
7       2     2    50  4   
8       2     3    25  4   

Upvotes: 1

Related Questions