user4326875
user4326875

Reputation:

Pulling data from 332 .csv files and returning the number of observed cases for each variable in the file

I am writing an R function that reads a directory full of 332 .csv files and reports the number of completely observed cases in each data file. The function returns a data frame where the first column is the name of the file and the second column is the number of complete cases. For example:

ID  OBS
1   233
2   149
etc.

Here is the code I wrote:

complete <- function(directory, id = 1:332) {
    files_full <- list.files(directory, full.names = TRUE)
    nobs <- sum(complete.cases(files_full[id]))
    data <- data.frame(id, nobs)
    return(data)

}

The problem here is that, while the function does run, it gives me a value of 1 for each "nobs" in my column.

Upvotes: 2

Views: 7494

Answers (6)

Khalil
Khalil

Reputation: 1

I think this is simpler and more easy to understand:

    complete <- function(dir, id = 1:332){

    dir <- list.files(dir, full.names = T)
    count <- data.frame()

    for(i in id){
            ok <- sum(complete.cases(read.csv(dir[i])))
            count <- rbind(count, ok)
    }
    count_table <- cbind(id, count)
    colnames(count_table) <- c("id", "nobs")
    count_table
    }

Upvotes: 0

Richhard
Richhard

Reputation: 11

This was my solution which seems easier to read:

complete <- function(directory,id=1:332){
    filenames <- sprintf("%03d.csv", id)
    filePaths <- paste(directory, filenames, sep="/")
    nFiles=length(id)
    output <- matrix(ncol=2, nrow=nFiles)
    for(i in 1:nFiles){
        output[i,]= c(id[i],sum(complete.cases(read.csv(filePaths[i]))))
    }
    output <- setNames(data.frame(output),c("id","nobs"))
    output
}

Hope this helps someone.

Upvotes: 1

Jorge
Jorge

Reputation: 358

complete <- function(directory, id = 1:332) {
  x = list.files(directory)
  y = x[match(id, as.numeric(sub(".csv","",x)))]
  z = file.path(directory, y)
  a = function(z) sum(complete.cases(read.csv(z)))
  data.frame(id = id, nobs = unlist(lapply(z,a)))
}

complete("specdata",4:14)

Edit:
This code matches id with file name, instead of subsetting all files by number id. A nested function is created to read and analyze files. In the data.frame, the new function is list applied to the vector of matched file paths, within an unlist() to return only the number of complete cases.

Upvotes: 2

baptiste
baptiste

Reputation: 77096

Let's go through what your code is actually doing:

complete <- function(directory, id = 1:332) {
    # list files
    files_full <- list.files(directory, full.names = TRUE)
    # create an empty placeholder, to grow sequentially. Known in some circles as R Inferno 
    # http://www.burns-stat.com/documents/books/the-r-inferno/
    dat <- data.frame()
    for (i in id) { # select filenames based on their position in the list 
                    # (prone to errors, because it depends on the order)
            dat <- rbind(dat, read.csv(files_full[i])) # read the data, and append it 
                                                       # to previous data.frame. Why??
            nobs <- sum(complete.cases(files_full[i])) # number of complete cases...
                                                       # in a character vector of length 1
            data <- data.frame(id, nobs)               # this gets overwritten every time
    }
    data
}

Below's what you probably meant to write:

complete <- function(directory, id = 1:332) {
    # list files
    files_full <- list.files(directory, full.names = TRUE)
    files_toread <- files_full[id] # filter out unwanted files (tip: ?grep is better)
    output <- data.frame(id = id, nobs = 0)
    for (i in id) { 
            tmp <- read.csv(files_toread[i]) # read the data
            nobs <- sum(complete.cases(tmp)) # number of complete cases
            output[i, "nobs"] <- nobs
    }
    output
}

Upvotes: 3

zero323
zero323

Reputation: 330063

A little bit different approach:

complete <- function(directory, pattern = "csv$") {
    setNames(as.data.frame(do.call(
            rbind,
            lapply(
                list.files(directory, pattern = pattern, full.names=TRUE),
                function(fname) list(fname, sum(complete.cases(read.csv(fname))))
            )
   )), c("file", "complete"))
}

If you want to keep id as an argument:

complete <- function(directory, id = 1:332) {
    count_complete <- function(fname) sum(complete.cases(read.csv(fname)))
    fnames <- list.files(directory, full.names=TRUE)[id]
    data.frame(id = id, complete = unlist(lapply(fnames, count_complete)))
}

Upvotes: 4

baptiste
baptiste

Reputation: 77096

sum(complete.cases(files_full[i])) doesn't make much sense, it's probably where you went wrong.

I'd do it like this,

1- define a function to treat a single dataset,

read_and_summarise <- function(f, ...) {d <- read.csv(f, ...) ; sum(complete.cases(d))}

2- apply this function to all files,

lf <- list.files(directory, full.names = TRUE)
vapply(lf, read_and_summarise, 0L)

(untested)

Upvotes: 3

Related Questions