Zach
Zach

Reputation: 1396

Subsetting dataframes stored in a list

I'm having difficulty figuring out how to subset some specific data from dataframes stored in a list. I've read numerous articles on this site as well as UCLA and Adv-R and I'm just not making any progress.

Advanced-R for Subsetting UCLA Advanced R for Subsetting

My function reads in arguments that help it identify what data I'm interested in pulling out across a range of files. So, dat1, dat2 and dat3 in files 1:15 stored in a directory of files (1:999).

Using an lapply and read.CSV I have read all of my files (1:15) into a list of dataframes.

 x <- lapply(directory[id], function(i) {
        read.csv(i, header = TRUE)
         } )

An example looks like this via str(x) [of just the first element]:

List of 15
 $ :'data.frame':   1461 obs. of  4 variables:
  ..$ DateObv   : Factor w/ 1461 levels "2003-01-01","2003-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
  ..$ dat1: num [1:1461] NA NA NA NA NA NA NA NA NA NA ...
  ..$ dat2: num [1:1461] NA NA NA NA NA NA NA NA NA NA ...
  ..$ ID     : int [1:1461] 1 1 1 1 1 1 1 1 1 1 ...

So in the argument to my function I want to tell it give me dat1 from files 1:15 and then I'll do a mean of the results.

I thought maybe I could use another lapply to subset dat1 specifically into a vector but it keeps returning a NULL value, or "list()" or just errors that set object cannot be subset, or subset missing argument. I've tried subset, bracket notation.

How do you recommend that I take a subset of the list of dataframes so that I get back all dat1's or dat2's into a single vector that I can run a mean against?

Thank you for your time and consideration.

Upvotes: 0

Views: 178

Answers (2)

Edzer Pebesma
Edzer Pebesma

Reputation: 4121

create a similar data set:

> x = list(data.frame(dat1 = 1:3,dat2=10), data.frame(dat1 = 2:4,dat2=10))
> str(x)
List of 2
 $ :'data.frame':   3 obs. of  2 variables:
  ..$ dat1: int [1:3] 1 2 3
  ..$ dat2: num [1:3] 10 10 10
 $ :'data.frame':   3 obs. of  2 variables:
  ..$ dat1: int [1:3] 2 3 4
  ..$ dat2: num [1:3] 10 10 10

use lapply to select variable dat1:

> lapply(x, function(X) X$dat1)
[[1]]
[1] 1 2 3

[[2]]
[1] 2 3 4

bind the resulting list to a vector with c, call mean on the resulting vector, and add na.rm=TRUE to remove the NA values:

> mean(do.call(c, lapply(x, function(X) X$dat1)),na.rm=TRUE)
[1] 2.5

Upvotes: 0

shirewoman2
shirewoman2

Reputation: 1928

I love plyr for this sort of thing. I would do something like this if you want the mean for each data.frame:

 library(plyr)
 ldply(x, summarize, Mean = mean(dat1))

or, if you want a long vector of all the dat1 columns and you want to take the mean of all of them, I'd still use plyr but do this:

 x <- rbind.fill(x)
 mean(x$dat1)

Upvotes: 1

Related Questions