Reputation: 31

How to count number of rows in a data frame filtered with complete.case function?

I am working on an assignment to find the title. But I can't get the result to match exactly what the assignment requires. The iteration does count the observation in both columns, "sulfate" and "nitrate", with readings. But my code return one result with each iteration? Can you help me tweak the codes?

https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip

It's a zip file containing 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor, and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file "200.csv" Each file contains three variables.

Some observations indicate NA, thus omitted. I tried the following, but can't work right.

    complete <- function(directory, id = 1:332){
        path1<- setwd(paste("C:/Users/my_drive/Documents/Coursera/R", directory, sep= "/"))
        list1<- list.files(path1) 
        data1<- data.frame() 
        
    for(i in id){
    data1 <- rbind(data1, read.csv(list1[i])) 
    data1 <- data1[complete.cases(data1),] 
        
    I <- c(data1$ID) 
    N <- c(data1$n) 
    complete_data <- data.frame(id = I, nobs = N) 
    
    print(complete_data) 
    }}

I am supposed to get result like this:

but...

What's wrong with my code? Please help me keep this at a newbie level so I understand what I'm writing.

Upvotes: 0

Answers (2)

Nubie coder

Reputation: 31

I finally wrote the following version that I can understand myself, after reviewing others' approaches.

   > complete <- function(directory, id = 1:332){
                list1<- list.files(setwd(paste("C:/my_path/Coursera/R_John_Hopkins", directory, sep= "/")))
                data1<- data.frame()
            for(i in id){
              data1 <- rbind(data1, read.csv(list1[i]))
              data2 <- data1[complete.cases(data1),]}

tibble_complete <- data2 %>%
                group_by(ID) %>%
                summarise(nobs = n())

df_complete <- as.data.frame(tibble_complete)
colnames(df_complete)[1] <- "id"
df_complete
}

# output

> complete("specdata", 30:25)
  id nobs
1 25  463
2 26  586
3 27  338
4 28  475
5 29  711
6 30  932

The only thing I don't understand is why R evaluate 30:25, but return the tibble/df 25:30 in sequence? What is the way for R to return the entries in the order exactly as specified in the 2nd parameter?

Upvotes: 1

Josh Allen

Reputation: 1280

Update: I have updated the code a bit!

setwd('~/Downloads')


read_in_dat = \(your_path, id = 1:332){
  
  files = list.files(path = paste(your_path),
                     pattern = '*.csv' , full.names = TRUE)

    data_one = lapply(id, \(x) read.csv(files[x]))
    
    data_one = do.call(rbind, data_one)
    
    data_one_sans_na = na.omit(data_one)
   
   complete_data = data.frame(nobs =  with(data_one_sans_na, table(ID)))

   colnames(complete_data) = c('ID', "nobs")
    
  print(complete_data)
 
}


check = read_in_dat(your_path = 'specdata', id = c(1,50, 300))
#>    ID nobs
#> 1   1  117
#> 2  50  459
#> 3 300  927

^{Created on 2024-06-30 with reprex v2.1.0}

What changes did I make?

Avoided using absolute paths. The reason you want to do this is because if you wanted to reuse this function in another working directory it wouldn't work! It would look for a directory somewhere in the coursera/R folder! Instead you should use what are known as relative paths. Because the paths on my computer are not the same as yours!

I changed how you read in and combined the csvs. for loops in R can be slow because growing a vector can be quit slow if you don't do it well. See r4ds for more on this subject. The apply and map family of functions are easier for me to think through and in this case provide some speed gains.



read_in_dat_for = \(your_path, id = 1:332){
  
  files = list.files(path = paste(your_path),
                     pattern = '*.csv', full.names = TRUE)
  
  data_one = data.frame()
  
  for (i in id) {
    data_one = rbind(data_one, read.csv(files[i]))
  }
  

  
  data_one_sans_na = na.omit(data_one)
  
  complete_data = data.frame(nobs =  with(data_one_sans_na, table(ID)))
  
  colnames(complete_data) = c('ID', "nobs")
  
  return(complete_data)
  
}


microbenchmark::microbenchmark(
  for_loop_version = read_in_dat_for(your_path = 'specdata'),
  lapply_version = read_in_dat(your_path = 'specdata'),
times = 10)
#> Unit: seconds
#>              expr       min        lq     mean    median        uq      max
#>  for_loop_version 41.023760 42.468032 45.46431 43.196820 44.071153 68.00444
#>    lapply_version  3.849209  4.749236  6.92037  4.882162  5.694265 17.24910
#>  neval
#>     10
#>     10

^{Created on 2024-06-30 with reprex v2.1.0}

Somebody who is better at writing for loops could probably write a faster loop! For more on this see. Hopefully this helps!

OG answer

If I understood what you are looking for you may be looking to read a bunch of csvs, drop the missing observations, add the number of rows, and then combine the separate data frames into one data frame.

library(tidyverse)

files = list.files(path = 'specdata', pattern = '*.csv', full.names = TRUE)


complete_tidy = function(data){
  new_dat = data |> 
    drop_na() |> 
    mutate(nobs = n())
  return(new_dat)
}

complete_base = function(data){
  new_dat = data[complete.cases(data), ] 
  new_dat$nobs = nrow(new_dat)
  
  return(new_dat)
}

I generally prefer lapply or purrr::map to iterate overstuff just cause its easier to think through a function. These are just for loops in disguise. So now we are just going to read the csv's in and then apply our function.

read_in_csvs = map(files, data.table::fread) 
  
spec_data = map(read_in_csvs, \(x) complete_tidy(x)) |> 
  # make into a dataframe
  list_rbind()

spec_data = map(read_in_csvs, \(x) complete_base(x)) |>
  list_rbind()

One of the cool facts I learned recently about read_csv is that you can skip the map part and just read in the list of files and it will handle binding them together behind the scenes. Then you can just use dplyr or whatever you want to manipulate the data to achieve the same results

cleaned_data = read_csv(files) |> 
  drop_na() |> 
  mutate(nobs = n(), .by = ID)