Reputation: 31
I am working on an assignment to find the title. But I can't get the result to match exactly what the assignment requires. The iteration does count the observation in both columns, "sulfate" and "nitrate", with readings. But my code return one result with each iteration? Can you help me tweak the codes?
https://d396qusza40orc.cloudfront.net/rprog%2Fdata%2Fspecdata.zip
It's a zip file containing 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor, and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file "200.csv" Each file contains three variables.
Some observations indicate NA, thus omitted. I tried the following, but can't work right.
complete <- function(directory, id = 1:332){
path1<- setwd(paste("C:/Users/my_drive/Documents/Coursera/R", directory, sep= "/"))
list1<- list.files(path1)
data1<- data.frame()
for(i in id){
data1 <- rbind(data1, read.csv(list1[i]))
data1 <- data1[complete.cases(data1),]
I <- c(data1$ID)
N <- c(data1$n)
complete_data <- data.frame(id = I, nobs = N)
print(complete_data)
}}
I am supposed to get result like this:
but...
What's wrong with my code? Please help me keep this at a newbie level so I understand what I'm writing.
Upvotes: 0
Views: 79
Reputation: 31
> complete <- function(directory, id = 1:332){
list1<- list.files(setwd(paste("C:/my_path/Coursera/R_John_Hopkins", directory, sep= "/")))
data1<- data.frame()
for(i in id){
data1 <- rbind(data1, read.csv(list1[i]))
data2 <- data1[complete.cases(data1),]}
tibble_complete <- data2 %>%
group_by(ID) %>%
summarise(nobs = n())
df_complete <- as.data.frame(tibble_complete)
colnames(df_complete)[1] <- "id"
df_complete
}
# output
> complete("specdata", 30:25)
id nobs
1 25 463
2 26 586
3 27 338
4 28 475
5 29 711
6 30 932
Upvotes: 1
Reputation: 1280
Update: I have updated the code a bit!
setwd('~/Downloads')
read_in_dat = \(your_path, id = 1:332){
files = list.files(path = paste(your_path),
pattern = '*.csv' , full.names = TRUE)
data_one = lapply(id, \(x) read.csv(files[x]))
data_one = do.call(rbind, data_one)
data_one_sans_na = na.omit(data_one)
complete_data = data.frame(nobs = with(data_one_sans_na, table(ID)))
colnames(complete_data) = c('ID', "nobs")
print(complete_data)
}
check = read_in_dat(your_path = 'specdata', id = c(1,50, 300))
#> ID nobs
#> 1 1 117
#> 2 50 459
#> 3 300 927
Created on 2024-06-30 with reprex v2.1.0
What changes did I make?
Avoided using absolute paths. The reason you want to do this is because if you wanted to reuse this function in another working directory it wouldn't work! It would look for a directory somewhere in the coursera/R
folder! Instead you should use what are known as relative paths. Because the paths on my computer are not the same as yours!
I changed how you read in and combined the csvs. for loops in R can be slow because growing a vector can be quit slow if you don't do it well. See r4ds for more on this subject. The apply
and map
family of functions are easier for me to think through and in this case provide some speed gains.
read_in_dat_for = \(your_path, id = 1:332){
files = list.files(path = paste(your_path),
pattern = '*.csv', full.names = TRUE)
data_one = data.frame()
for (i in id) {
data_one = rbind(data_one, read.csv(files[i]))
}
data_one_sans_na = na.omit(data_one)
complete_data = data.frame(nobs = with(data_one_sans_na, table(ID)))
colnames(complete_data) = c('ID', "nobs")
return(complete_data)
}
microbenchmark::microbenchmark(
for_loop_version = read_in_dat_for(your_path = 'specdata'),
lapply_version = read_in_dat(your_path = 'specdata'),
times = 10)
#> Unit: seconds
#> expr min lq mean median uq max
#> for_loop_version 41.023760 42.468032 45.46431 43.196820 44.071153 68.00444
#> lapply_version 3.849209 4.749236 6.92037 4.882162 5.694265 17.24910
#> neval
#> 10
#> 10
Created on 2024-06-30 with reprex v2.1.0
Somebody who is better at writing for loops
could probably write a faster loop! For more on this see. Hopefully this helps!
If I understood what you are looking for you may be looking to read a bunch of csvs, drop the missing observations, add the number of rows, and then combine the separate data frames into one data frame.
library(tidyverse)
files = list.files(path = 'specdata', pattern = '*.csv', full.names = TRUE)
complete_tidy = function(data){
new_dat = data |>
drop_na() |>
mutate(nobs = n())
return(new_dat)
}
complete_base = function(data){
new_dat = data[complete.cases(data), ]
new_dat$nobs = nrow(new_dat)
return(new_dat)
}
I generally prefer lapply
or purrr::map
to iterate overstuff just cause its easier to think through a function. These are just for loops in disguise. So now we are just going to read the csv's in and then apply our function.
read_in_csvs = map(files, data.table::fread)
spec_data = map(read_in_csvs, \(x) complete_tidy(x)) |>
# make into a dataframe
list_rbind()
spec_data = map(read_in_csvs, \(x) complete_base(x)) |>
list_rbind()
One of the cool facts I learned recently about read_csv
is that you can skip the map
part and just read in the list of files and it will handle binding them together behind the scenes. Then you can just use dplyr
or whatever you want to manipulate the data to achieve the same results
cleaned_data = read_csv(files) |>
drop_na() |>
mutate(nobs = n(), .by = ID)
Upvotes: 0