Reputation: 149
I'm new to programming and trying to count the number of rows in a file after removing the NA values. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases.
Here is my code:
complete <- function(directory, id = 1:332){
setwd(directory)
df <- data.frame()
for (i in seq_along(id)){
if (id[i] < 10){
file_name <- paste("00",id[i],".csv", sep = "")
}
else if (id[i] >= 10 & id[i] < 100){
file_name <- paste("0",id[i],".csv", sep = "")
}
else{
file_name <- paste(id[i],".csv", sep = "")
}
file <- read.csv(as.character(file_name))
newfile <- na.omit(file)
#print(newfile)
df <- data.frame(id = id, nobs = nrow(newfile))
}
print(df)
}
When I pass in a vector of 1:3 like so: complete("specdata", 1:3) I'm getting the following output:
id nobs
1 243
2 243
3 243
Where id is the file number of files listed 1 to 332 and nobs equals the number of complete cases.
It seems as though it's taking the last item in my dataframe and repeating for each id and I don't know how to fix it. I get tripped up on programming logic like this being a beginner. Also, I saw a few other solutions to this problem but they were using complete.cases which I didn't understand how to apply. So each id in the data frame should have it's own count of complete cases (the nobs column in the data frame).
Upvotes: 1
Views: 1071
Reputation: 28441
Here is a shortened version to study from. Notice that I do not have to explicitly paste
the zeroes since the files are already in order. Try list.files(path="specdata", full.names=TRUE)
alone to see what that function does:
complete <- function(directory, id=1:332) {
lst <- sapply(id, function(x) {
df <- read.csv(list.files(path=directory, full.names=TRUE, pattern="csv")[x])
sum(complete.cases(df))
})
data.frame(id,nobs=unlist(lst))
}
edit
Difference between na.omit
and complete.cases
are:
#Example
#Create data.frame with an NA value
df <- head(iris,3)
df[1,1] <- NA
df
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 NA 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
#'na.omit' will return a data.frame with non-NA rows:
na.omit(df)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
#'complete.cases' gives TRUEs and FALSEs for the rows with NAs
complete.cases(df)
#[1] FALSE TRUE TRUE
I use 'complete.cases' because I just want the total count of non-NA rows. I don't need the data.frame itself, which is what 'na.omit' gives.
I can add up the TRUEs and FALSEs to get the total with sum(complete.cases(df))
. The program will know to turn each TRUE
into 1
and each FALSE
into 0
.
Upvotes: 1
Reputation: 698
you have to make a couple of changes in your code... first, define your data.frame in full extent at the beginning, before your loop.
df <- data.frame(id = id, nobs = NA)
second, after you create newfile
replace your df <- data.frame...
instruction with:
df[i,2] <- nrow(newfile)
Upvotes: 0