ajd
ajd

Reputation: 76

Take 20+ subsets of data?

I have a dataset and would like to take a lot of subsets based on various columns, values, and conditional operators. I think the most desirable output is a list containing all of these subsetted data frames as separate elements in the list. I attempted to do this by building a data frame that contains the subset conditions I would like to use, building a function, then using apply to feed that data frame to the function, but that didn't work. I'm sure there's probably a better method that uses an anonymous function or something like that, but I'm not sure how I would implement that. Below is an example code that should produce 8 subsets of data.

Original dataset, where x1 and x2 are scored on items that won't be used for subsetting and RT and LS are the variables that will be a subset on:

df <- data.frame(x1 = rnorm(100),
                 x2 = rnorm(100),
                 RT = abs(rnorm(100)),
                 LS = sample(1:10, 100, replace = T))

Dataframe containing the conditions for subsetting. E.g., the first subset of data should be any observations with values greater than or equal to 0.5 in the RT column, the second subset should be any observations greater than or equal to 1 in the subset column, etc. There should be 8 subsets, 4 done on the RT variable and 4 done on the LS variable.

subsetConditions <- data.frame(column = rep(c("RT", "LS"), each = 4),
                      operator = rep(c(">=", "<="), each = 4),
                      value = c(0.5, 1, 1.5, 2,
                                9, 8, 7, 6))

And this is the ugly function I wrote to attempt to do this:

subsetFun <- function(x){
  subset(df, eval(parse(text = paste(x))))
}  

subsets <- apply(subsetConditions, 1, subsetFun)

Thanks for any help!

Upvotes: 4

Views: 86

Answers (2)

Parfait
Parfait

Reputation: 107687

Consider Map (wrapper to mapply) without any eval + parse. Since ==, <=, >=, and other operators can be used as functions with two arguments where 4 <= 5 can be written as `<=`(4,5) or "<="(4, 5), simply pass arguments elementwise and use get to reference the function by string:

sub_data <- function(col, op, val) {
  df[get(op)(df[[col]], val),]
}

sub_dfs <- with(subsetConditions, Map(sub_data, column, operator, value))

Output

str(sub_dfs)
List of 8
 $ RT:'data.frame': 62 obs. of  4 variables:
  ..$ x1: num [1:62] -1.12 -0.745 -1.377 0.848 1.63 ...
  ..$ x2: num [1:62] -0.257 -2.385 0.805 -0.313 0.662 ...
  ..$ RT: num [1:62] 0.693 1.662 0.731 2.145 0.543 ...
  ..$ LS: int [1:62] 5 5 1 2 9 1 5 9 3 10 ...
 $ RT:'data.frame': 36 obs. of  4 variables:
  ..$ x1: num [1:36] -0.745 0.848 0.908 -0.761 0.74 ...
  ..$ x2: num [1:36] -2.3849 -0.3131 -2.4645 -0.0784 0.8512 ...
  ..$ RT: num [1:36] 1.66 2.15 1.74 1.65 1.13 ...
  ..$ LS: int [1:36] 5 2 1 5 9 10 2 7 1 3 ...
 $ RT:'data.frame': 14 obs. of  4 variables:
  ..$ x1: num [1:14] -0.745 0.848 0.908 -0.761 -1.063 ...
  ..$ x2: num [1:14] -2.3849 -0.3131 -2.4645 -0.0784 -2.9886 ...
  ..$ RT: num [1:14] 1.66 2.15 1.74 1.65 2.63 ...
  ..$ LS: int [1:14] 5 2 1 5 5 6 9 4 8 4 ...
 $ RT:'data.frame': 3 obs. of  4 variables:
  ..$ x1: num [1:3] 0.848 -1.063 0.197
  ..$ x2: num [1:3] -0.313 -2.989 0.709
  ..$ RT: num [1:3] 2.15 2.63 2.05
  ..$ LS: int [1:3] 2 5 6
 $ LS:'data.frame': 92 obs. of  4 variables:
  ..$ x1: num [1:92] -1.12 -0.745 -1.377 0.848 0.612 ...
  ..$ x2: num [1:92] -0.257 -2.385 0.805 -0.313 0.958 ...
  ..$ RT: num [1:92] 0.693 1.662 0.731 2.145 0.489 ...
  ..$ LS: int [1:92] 5 5 1 2 1 9 1 5 9 3 ...
 $ LS:'data.frame': 78 obs. of  4 variables:
  ..$ x1: num [1:78] -1.12 -0.745 -1.377 0.848 0.612 ...
  ..$ x2: num [1:78] -0.257 -2.385 0.805 -0.313 0.958 ...
  ..$ RT: num [1:78] 0.693 1.662 0.731 2.145 0.489 ...
  ..$ LS: int [1:78] 5 5 1 2 1 1 5 3 5 2 ...
 $ LS:'data.frame': 75 obs. of  4 variables:
  ..$ x1: num [1:75] -1.12 -0.745 -1.377 0.848 0.612 ...
  ..$ x2: num [1:75] -0.257 -2.385 0.805 -0.313 0.958 ...
  ..$ RT: num [1:75] 0.693 1.662 0.731 2.145 0.489 ...
  ..$ LS: int [1:75] 5 5 1 2 1 1 5 3 5 2 ...
 $ LS:'data.frame': 62 obs. of  4 variables:
  ..$ x1: num [1:62] -1.12 -0.745 -1.377 0.848 0.612 ...
  ..$ x2: num [1:62] -0.257 -2.385 0.805 -0.313 0.958 ...
  ..$ RT: num [1:62] 0.693 1.662 0.731 2.145 0.489 ...
  ..$ LS: int [1:62] 5 5 1 2 1 1 5 3 5 2 ...

Upvotes: 3

AndrewGB
AndrewGB

Reputation: 16856

You were actually pretty close with your function, but just needed to make an adjustment. So, with paste for each row, you need to collapse all 3 columns so that it is only 1 string rather than 3, then it can properly evaluate the expression.

subsetFun <- function(x){
  subset(df, eval(parse(text = paste(x, collapse = ""))))
}  

subsets <- apply(subsetConditions, 1, subsetFun)

Output

Then, it will return the 8 subsets.

str(subsets)

List of 8
 $ :'data.frame':   67 obs. of  4 variables:
  ..$ x1: num [1:67] -1.208 0.606 -0.17 0.728 -0.424 ...
  ..$ x2: num [1:67] 0.4058 -0.3041 -0.3357 0.7904 -0.0264 ...
  ..$ RT: num [1:67] 1.972 0.883 0.598 0.633 1.517 ...
  ..$ LS: int [1:67] 8 9 2 10 8 5 3 4 7 2 ...
 $ :'data.frame':   35 obs. of  4 variables:
  ..$ x1: num [1:35] -1.2083 -0.4241 -0.0906 0.9851 -0.8236 ...
  ..$ x2: num [1:35] 0.4058 -0.0264 1.0054 0.0653 1.4647 ...
  ..$ RT: num [1:35] 1.97 1.52 1.05 1.63 1.47 ...
  ..$ LS: int [1:35] 8 8 5 4 7 3 1 6 8 6 ...
 $ :'data.frame':   16 obs. of  4 variables:
  ..$ x1: num [1:16] -1.208 -0.424 0.985 0.99 0.939 ...
  ..$ x2: num [1:16] 0.4058 -0.0264 0.0653 0.3486 -0.7562 ...
  ..$ RT: num [1:16] 1.97 1.52 1.63 1.85 1.8 ...
  ..$ LS: int [1:16] 8 8 4 6 10 2 6 6 3 9 ...
 $ :'data.frame':   7 obs. of  4 variables:
  ..$ x1: num [1:7] 0.963 0.423 -0.444 0.279 0.417 ...
  ..$ x2: num [1:7] 0.6612 0.0354 0.0555 0.1253 -0.3056 ...
  ..$ RT: num [1:7] 2.71 2.15 2.05 2.01 2.07 ...
  ..$ LS: int [1:7] 2 6 9 9 7 7 4
 $ :'data.frame':   91 obs. of  4 variables:
  ..$ x1: num [1:91] -0.952 -1.208 0.606 -0.17 -0.048 ...
  ..$ x2: num [1:91] -0.645 0.406 -0.304 -0.336 -0.897 ...
  ..$ RT: num [1:91] 0.471 1.972 0.883 0.598 0.224 ...
  ..$ LS: int [1:91] 6 8 9 2 1 8 4 5 3 4 ...
 $ :'data.frame':   75 obs. of  4 variables:
  ..$ x1: num [1:75] -0.952 -1.208 -0.17 -0.048 -0.424 ...
  ..$ x2: num [1:75] -0.6448 0.4058 -0.3357 -0.8968 -0.0264 ...
  ..$ RT: num [1:75] 0.471 1.972 0.598 0.224 1.517 ...
  ..$ LS: int [1:75] 6 8 2 1 8 4 5 3 4 1 ...
 $ :'data.frame':   65 obs. of  4 variables:
  ..$ x1: num [1:65] -0.9517 -0.1698 -0.048 0.2834 -0.0906 ...
  ..$ x2: num [1:65] -0.645 -0.336 -0.897 -2.072 1.005 ...
  ..$ RT: num [1:65] 0.471 0.598 0.224 0.486 1.053 ...
  ..$ LS: int [1:65] 6 2 1 4 5 3 4 1 7 4 ...
 $ :'data.frame':   58 obs. of  4 variables:
  ..$ x1: num [1:58] -0.9517 -0.1698 -0.048 0.2834 -0.0906 ...
  ..$ x2: num [1:58] -0.645 -0.336 -0.897 -2.072 1.005 ...
  ..$ RT: num [1:58] 0.471 0.598 0.224 0.486 1.053 ...
  ..$ LS: int [1:58] 6 2 1 4 5 3 4 1 4 2 ...

Upvotes: 1

Related Questions