Reputation: 245
I'm trying to make a function that reads multiple csv files into a data frame, checks to see how many complete (no NAs) observations there are, and then, if the number of complete observations is greater than a threshold that is passed to the function as an argument, returns a vector with the correlations between two columns in each data frame that fit the criteria.
Right now I have the following code:
> dput(corr)
function (threshold = 0, directory = "/Users/marsh/datasciencecoursera/specdata/")
{
setwd(directory)
data_files <- list.files()
output <- c()
for (i in data_files) {
raw_data <- read.csv(data_files[i])
raw_data_nona <- na.omit(raw_data)
if (nrow(raw_data_nona) > threshold) {
sulfate <- raw_data_nona[, "sulfate"]
nitrate <- raw_data_nona[, "nitrate"]
correlation <- cor(sulfate, y = nitrate)
ouput <- c(ouput, correlation)
}
}
ouput
}
When I try to run the code with a threshold of 150, 200, 400, etc, I get an error message that reads:
Error in file(file, "rt") : cannot open the connection In addition: Warning message:
In file(file, "rt") : cannot open file 'NA': No such file or directory
I'm not sure what is going wrong. I've checked that the directory is right countless times and when I run the code in the console line by line, in piece-meal type manner, it sometimes works. Any help on why the function can't seem to connect to the files would be appreciated.
Upvotes: 0
Views: 695
Reputation: 60230
I think your problem is in these two lines:
for (i in data_files) {
raw_data <- read.csv(data_files[i])
I assume data_files
is a vector of filenames like c("data1.csv", "data2.csv")
. Then on each iteration in the for loop, i
will be a string like "data1.csv"
. It looks like you expected it to be a number, the index of the current position. You don't need to index back into data_files
, you already have the string, so just do:
for (i in data_files) {
raw_data <- read.csv(i)
Upvotes: 2