johnnewbie25
johnnewbie25

Reputation: 149

Trying to count the rows in a data.frame after removing NA's using na.omit()

I'm new to programming and trying to count the number of rows in a file after removing the NA values. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases.

Here is my code:

complete <- function(directory, id = 1:332){
  setwd(directory)

  df <- data.frame()
  for (i in seq_along(id)){
    if (id[i] < 10){ 
    file_name <- paste("00",id[i],".csv", sep = "")
  }
    else if (id[i] >= 10 & id[i] < 100){
    file_name <- paste("0",id[i],".csv", sep = "")
  }
    else{
    file_name <- paste(id[i],".csv", sep = "")
  }
    file <- read.csv(as.character(file_name))
    newfile <- na.omit(file)
    #print(newfile)

    df <- data.frame(id = id, nobs = nrow(newfile))

  }

    print(df)

}

When I pass in a vector of 1:3 like so: complete("specdata", 1:3) I'm getting the following output:

id    nobs
1     243
2     243
3     243

Where id is the file number of files listed 1 to 332 and nobs equals the number of complete cases.

It seems as though it's taking the last item in my dataframe and repeating for each id and I don't know how to fix it. I get tripped up on programming logic like this being a beginner. Also, I saw a few other solutions to this problem but they were using complete.cases which I didn't understand how to apply. So each id in the data frame should have it's own count of complete cases (the nobs column in the data frame).

Upvotes: 1

Views: 1071

Answers (2)

Pierre L
Pierre L

Reputation: 28441

Here is a shortened version to study from. Notice that I do not have to explicitly paste the zeroes since the files are already in order. Try list.files(path="specdata", full.names=TRUE) alone to see what that function does:

complete <- function(directory, id=1:332) {
  lst <- sapply(id, function(x) {
           df <- read.csv(list.files(path=directory, full.names=TRUE, pattern="csv")[x])
           sum(complete.cases(df))
  })
  data.frame(id,nobs=unlist(lst))
}

edit

Difference between na.omit and complete.cases are:

#Example
#Create data.frame with an NA value
df <- head(iris,3)
df[1,1] <- NA
df
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1           NA         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa


#'na.omit' will return a data.frame with non-NA rows:
na.omit(df)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa

#'complete.cases' gives TRUEs and FALSEs for the rows with NAs
complete.cases(df)
#[1] FALSE  TRUE  TRUE

I use 'complete.cases' because I just want the total count of non-NA rows. I don't need the data.frame itself, which is what 'na.omit' gives.

I can add up the TRUEs and FALSEs to get the total with sum(complete.cases(df)). The program will know to turn each TRUE into 1 and each FALSE into 0.

Upvotes: 1

Carlos Alberto
Carlos Alberto

Reputation: 698

you have to make a couple of changes in your code... first, define your data.frame in full extent at the beginning, before your loop.

df <- data.frame(id = id, nobs = NA)

second, after you create newfile replace your df <- data.frame... instruction with:

df[i,2] <- nrow(newfile)

Upvotes: 0

Related Questions