shoestringfries
shoestringfries

Reputation: 279

how to output a dataframe in the correct format in r?

I have to write a function that reads a directory full of files and reports the number of completely observed cases in each data file (No NA values in each observable instance). The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases. Please see below for my draft, I hope the comments help!

complete <- function (directory, id = 1:332){
  nobs = numeric() #currently blank
    # nobs is the number of complete cases in each file
  data = data.frame() #currently blank dataframe
  for (i in id){
    #get the right filepath
    newread = read.csv(paste(directory,"/",formatC(i,width=3,flag="0"),".csv",sep=""))
    my_na <- is.na(newread) #let my_na be the logic vector of true and false na values 
    nobs = sum(!my_na) #sum up all the not na values (1 is not na, 0 is na, due to inversion). 
    #this returns # of true values
    #add on to the existing dataframe
    data = c(data, i, nobs, row.names=i)
  }
  data # return the updated data frame for the specified id range
}

The output of a sample run complete("specdata",1) is

[[1]]
[1] 1

[[2]]
[1] 3161

$row.names
[1] 1

I am not sure why it is not displaying in the regular dataframe format. Also I am pretty sure my numbers are not correct either. I am working under the assumption that in each ith instance, newread would read all the data in that file before proceeding on to my_na. Is that a source of the errors? Or is it something else? Please explain. Thanks!

Upvotes: 1

Views: 695

Answers (2)

prateek1592
prateek1592

Reputation: 547

Since I am not aware of what data you are referring to, and since there is no sample given, I could come up with this as an edit to your function -

complete <- function (directory, id = 1:332){
  data = data.frame()
  for (i in id){
    newread = read.csv(paste(directory,"/",formatC(i,width=3,flag="0"),".csv",sep=""))
    newread = newread[complete.cases(newread),]
    nobs = nrow(newread)
    data[nrow(data)+1,] = c(i,nobs)
  }
  names(data) <- c("Name","NotNA")
  return(data)
}

Upvotes: 0

Pierre L
Pierre L

Reputation: 28461

You should think about other approaches to adding values to a vector. The function is currently overwriting all over the place. You asked about when id=1, it will be worse when you feed multiple ids to the function. It will only return the last one. Here's why:

#Simple function that takes ids and adds 2 to them
myFun <- function(id) {

  nobs = c()

  for(i in id) {

    nobs = 2 + i
  }

  return(nobs)
}

myFun(c(2,3,4))
[1] 6

I told it for each id return the value plus 2, but it only gave me the last one. I should write it this way:

myFun2 <- function(id) {

  nobs = c()

  for(i in 1:length(id)) {

    nobs[i] <- 2 + id[i]
  }

  return(nobs)
}

myFun2(c(2,3,4))
[1] 4 5 6

Now it's giving the right output. What's different? First the nobs object is not overwritten anymore, it is appended. Note the subset brackets and the new counter in the for loop header.

Also building objects is not the best way to use R. It is built to do more with less:

complete <- function(directory, id=1:332) {
  nobs <- sapply(id, function(i) {
    sum(complete.cases(read.csv(list.files(path=directory, full.names=TRUE)[i]) )) } )
  data.frame(id, nobs)
}

If you would like to fix your code, try something like:

complete <- function (directory, id = 1:332){
  nobs = numeric(length(id)) #currently blank
    # nobs is the number of complete cases in each file
  for (i in 1:length(id)) {
    #get the right filepath
    newread = read.csv(paste(directory,"/",formatC( id[i] ,width=3,flag="0"),".csv",sep=""))
    my_na <- is.na(newread) #let my_na be the logic vector of true and false na values 
    nobs[i] = sum(!my_na) #sum up all the not na values (1 is not na, 0 is na, due to inversion). 
    #this returns # of true values
  }
  data.frame(id, nobs) # return the updated data frame for the specified id range
}

Upvotes: 2

Related Questions