Reputation:
I am writing an R function that reads a directory full of 332 .csv files and reports the number of completely observed cases in each data file. The function returns a data frame where the first column is the name of the file and the second column is the number of complete cases. For example:
ID OBS
1 233
2 149
etc.
Here is the code I wrote:
complete <- function(directory, id = 1:332) {
files_full <- list.files(directory, full.names = TRUE)
nobs <- sum(complete.cases(files_full[id]))
data <- data.frame(id, nobs)
return(data)
}
The problem here is that, while the function does run, it gives me a value of 1 for each "nobs" in my column.
Upvotes: 2
Views: 7494
Reputation: 1
I think this is simpler and more easy to understand:
complete <- function(dir, id = 1:332){
dir <- list.files(dir, full.names = T)
count <- data.frame()
for(i in id){
ok <- sum(complete.cases(read.csv(dir[i])))
count <- rbind(count, ok)
}
count_table <- cbind(id, count)
colnames(count_table) <- c("id", "nobs")
count_table
}
Upvotes: 0
Reputation: 11
This was my solution which seems easier to read:
complete <- function(directory,id=1:332){
filenames <- sprintf("%03d.csv", id)
filePaths <- paste(directory, filenames, sep="/")
nFiles=length(id)
output <- matrix(ncol=2, nrow=nFiles)
for(i in 1:nFiles){
output[i,]= c(id[i],sum(complete.cases(read.csv(filePaths[i]))))
}
output <- setNames(data.frame(output),c("id","nobs"))
output
}
Hope this helps someone.
Upvotes: 1
Reputation: 358
complete <- function(directory, id = 1:332) {
x = list.files(directory)
y = x[match(id, as.numeric(sub(".csv","",x)))]
z = file.path(directory, y)
a = function(z) sum(complete.cases(read.csv(z)))
data.frame(id = id, nobs = unlist(lapply(z,a)))
}
complete("specdata",4:14)
Edit:
This code matches id with file name, instead of subsetting all files by number id. A nested function is created to read and analyze files. In the data.frame, the new function is list applied to the vector of matched file paths, within an unlist() to return only the number of complete cases.
Upvotes: 2
Reputation: 77096
Let's go through what your code is actually doing:
complete <- function(directory, id = 1:332) {
# list files
files_full <- list.files(directory, full.names = TRUE)
# create an empty placeholder, to grow sequentially. Known in some circles as R Inferno
# http://www.burns-stat.com/documents/books/the-r-inferno/
dat <- data.frame()
for (i in id) { # select filenames based on their position in the list
# (prone to errors, because it depends on the order)
dat <- rbind(dat, read.csv(files_full[i])) # read the data, and append it
# to previous data.frame. Why??
nobs <- sum(complete.cases(files_full[i])) # number of complete cases...
# in a character vector of length 1
data <- data.frame(id, nobs) # this gets overwritten every time
}
data
}
Below's what you probably meant to write:
complete <- function(directory, id = 1:332) {
# list files
files_full <- list.files(directory, full.names = TRUE)
files_toread <- files_full[id] # filter out unwanted files (tip: ?grep is better)
output <- data.frame(id = id, nobs = 0)
for (i in id) {
tmp <- read.csv(files_toread[i]) # read the data
nobs <- sum(complete.cases(tmp)) # number of complete cases
output[i, "nobs"] <- nobs
}
output
}
Upvotes: 3
Reputation: 330063
A little bit different approach:
complete <- function(directory, pattern = "csv$") {
setNames(as.data.frame(do.call(
rbind,
lapply(
list.files(directory, pattern = pattern, full.names=TRUE),
function(fname) list(fname, sum(complete.cases(read.csv(fname))))
)
)), c("file", "complete"))
}
If you want to keep id
as an argument:
complete <- function(directory, id = 1:332) {
count_complete <- function(fname) sum(complete.cases(read.csv(fname)))
fnames <- list.files(directory, full.names=TRUE)[id]
data.frame(id = id, complete = unlist(lapply(fnames, count_complete)))
}
Upvotes: 4
Reputation: 77096
sum(complete.cases(files_full[i]))
doesn't make much sense, it's probably where you went wrong.
I'd do it like this,
1- define a function to treat a single dataset,
read_and_summarise <- function(f, ...) {d <- read.csv(f, ...) ; sum(complete.cases(d))}
2- apply this function to all files,
lf <- list.files(directory, full.names = TRUE)
vapply(lf, read_and_summarise, 0L)
(untested)
Upvotes: 3