mto23
mto23

Reputation: 467

subset rows by a value while filtering columns in R

I have several datasets ("001.csv","002.csv", and so on, until 332) stored in the same folder, with the following structure (example):

id  p1    p2    
2   35.0  na    
2   5.00  2.05  
2   0.35  1.56  
2   na    0.79 
2   5.23  0.13
2   5.01  0.03

I need to create a function that would read one or more files and gives me back the number of cases where both "p1" and "p2" have a given value (that is, no NA), for which I wrote this:

cc <- function(directory, id=1:332) {
    files_list <- list.files(directory, full.names = TRUE)
    for (i in id) {
            dat <- read.csv(files_list[i])
    }
    nobs <- length(which(!is.na(dat$p1) & !is.na(dat$p2)))
    completecases <- data.frame(id, nobs)
    completecases
    }

This works perfectly if I choose a single value for "id"; in that case, the outcome would be something like:

> cc(directory, 1)
    id nobs
    1  3

But, if I want to know the number of observations in more than one file, it gives me back, for each "id", the number of observations for the highest value of "id". For instance,

> cc(directory, 1:2)
    id nobs
    1  4
    2  4

instead of:

> cc(directory, 1:2)
    id nobs
    1  3
    2  4

I believe I need to subset my data by "id" or use "rbind" for each "id", but I have failed so far to get the right formula. Does anyone know how to fix this?

Upvotes: 0

Views: 89

Answers (2)

mto23
mto23

Reputation: 467

The reason it was not working is that I should include the "nobs" in the for loop, like:

cc <- function(directory, id=1:332) {
files_list <- list.files(directory, full.names = TRUE)
nobs <- c()
for (i in id) {
        dat <- read.csv(files_list[i])
        nobs <- c(nobs, length(which(!is.na(dat$p1) & !is.na(dat$p2))))
}
completecases <- data.frame(id, nobs)
completecases
}

Without considering it, the "nobs" as always accounting for the last value of "id" in dat.

Upvotes: 0

CPak
CPak

Reputation: 13581

Try something like this

I edit your function to handle a single file and return the number of rows after filtering out rows with NA

count_nobs <- function(fi) {
    require(dplyr)
    dat <- read.csv(fi)
    dat[complete.cases(dat), ] %>% count()
}

Call the function with purrr:map_dfr which iterates through files_list and rbinds the result

library(tidyverse)
files_list <- list.files(directory, full.names=TRUE)
result <- map_dfr(files_list, ~count_nobs(.x), .id="id")

Upvotes: 1

Related Questions