Reputation: 279
I have to write a function that reads a directory full of files and reports the number of completely observed cases in each data file (No NA values in each observable instance). The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases. Please see below for my draft, I hope the comments help!
complete <- function (directory, id = 1:332){
nobs = numeric() #currently blank
# nobs is the number of complete cases in each file
data = data.frame() #currently blank dataframe
for (i in id){
#get the right filepath
newread = read.csv(paste(directory,"/",formatC(i,width=3,flag="0"),".csv",sep=""))
my_na <- is.na(newread) #let my_na be the logic vector of true and false na values
nobs = sum(!my_na) #sum up all the not na values (1 is not na, 0 is na, due to inversion).
#this returns # of true values
#add on to the existing dataframe
data = c(data, i, nobs, row.names=i)
}
data # return the updated data frame for the specified id range
}
The output of a sample run complete("specdata",1)
is
[[1]]
[1] 1
[[2]]
[1] 3161
$row.names
[1] 1
I am not sure why it is not displaying in the regular dataframe format. Also I am pretty sure my numbers are not correct either.
I am working under the assumption that in each ith instance, newread
would read all the data in that file before proceeding on to my_na
. Is that a source of the errors? Or is it something else? Please explain. Thanks!
Upvotes: 1
Views: 695
Reputation: 547
Since I am not aware of what data you are referring to, and since there is no sample given, I could come up with this as an edit to your function -
complete <- function (directory, id = 1:332){
data = data.frame()
for (i in id){
newread = read.csv(paste(directory,"/",formatC(i,width=3,flag="0"),".csv",sep=""))
newread = newread[complete.cases(newread),]
nobs = nrow(newread)
data[nrow(data)+1,] = c(i,nobs)
}
names(data) <- c("Name","NotNA")
return(data)
}
Upvotes: 0
Reputation: 28461
You should think about other approaches to adding values to a vector. The function is currently overwriting all over the place. You asked about when id=1, it will be worse when you feed multiple ids to the function. It will only return the last one. Here's why:
#Simple function that takes ids and adds 2 to them
myFun <- function(id) {
nobs = c()
for(i in id) {
nobs = 2 + i
}
return(nobs)
}
myFun(c(2,3,4))
[1] 6
I told it for each id return the value plus 2, but it only gave me the last one. I should write it this way:
myFun2 <- function(id) {
nobs = c()
for(i in 1:length(id)) {
nobs[i] <- 2 + id[i]
}
return(nobs)
}
myFun2(c(2,3,4))
[1] 4 5 6
Now it's giving the right output. What's different? First the nobs
object is not overwritten anymore, it is appended. Note the subset brackets and the new counter in the for loop header.
Also building objects is not the best way to use R. It is built to do more with less:
complete <- function(directory, id=1:332) {
nobs <- sapply(id, function(i) {
sum(complete.cases(read.csv(list.files(path=directory, full.names=TRUE)[i]) )) } )
data.frame(id, nobs)
}
If you would like to fix your code, try something like:
complete <- function (directory, id = 1:332){
nobs = numeric(length(id)) #currently blank
# nobs is the number of complete cases in each file
for (i in 1:length(id)) {
#get the right filepath
newread = read.csv(paste(directory,"/",formatC( id[i] ,width=3,flag="0"),".csv",sep=""))
my_na <- is.na(newread) #let my_na be the logic vector of true and false na values
nobs[i] = sum(!my_na) #sum up all the not na values (1 is not na, 0 is na, due to inversion).
#this returns # of true values
}
data.frame(id, nobs) # return the updated data frame for the specified id range
}
Upvotes: 2