Corey
Corey

Reputation: 25

r - subset multiple data frames (or one list) based on cut-offs from a reference data frame

I have a series of data frames representing separate molecules (F0001, F0002,...,) that contain hundeds/thousands of scores from experiments using that molecule. Each data frame looks like this.

F0001

    PoseID  Score
1   AAAA_1  -13.70
2   AAAA_2  -9.21
3   AAAA_3  -7.60
4   AAAA_4  -6.28
5   ....

F0002

    PoseID  Score
1   AAAB_1  -14.90
2   AAAB_2  -13.92
3   AAAB_3  -13.49
4   AAAB_4  -11.95
5   ....

etc., etc.

Based on a cut-off, I'd like to sub-set the data to throw out any of the poses that fall above said cut-off, so, a simple binary comparison. A slight complicating factor is that the cut-off differs for each of (F0001, F0002,...,) so I've gone ahead and stored those in a data frame (let's call it cutoffs.

cutoffs

     FragmentID     ScoreCutOff
1    F0001          -9.69
2    F0002          -9.33
3    F0003          -8.50
4    ....

So I guess the question becomes, do I perform the comparison between cutoffs and each data frame or add all the data frames to a list and perform the comparison between cutoffs and the list of data frames there?

I'm feeling that Ari Friedman's answer is in the ballpark so I'm tooling about with sapply/any to get it working, usually one solves this sort of problem quite easily with nested loops and data structures in Python/CPP/Java but I'm new to doing it in R so I'm keen to hear of any other ideas people have. Of course, if I solve it myself in the interim, will post solution for critique.

Upvotes: 1

Views: 588

Answers (2)

Veerendra Gadekar
Veerendra Gadekar

Reputation: 4472

Assuming df1, df2 as you dataframes, you could try this using lapply

dflist = list(df1, df2)
names(dflist) = cutoffs$FragmentID

out = lapply(names(dflist), 
      function(x){ 
        cfval = subset(cutoff, FragmentID %in% x); 
        subset(dflist[[x]], Score < cfval$ScoreCutOff)
      })

names(out) = cutoff$FragmentID

#> out
#$F0001
#  PoseID Score
#1 AAAA_1 -13.7
# 
#$F0002
#  PoseID  Score
#1 AAAB_1 -14.90
#2 AAAB_2 -13.92
#3 AAAB_3 -13.49
#4 AAAB_4 -11.95

later if you want to have all the data-frames seperately, you could do this

# data-frames with names F0001, F0002, ....
list2env(out,.GlobalEnv)

Upvotes: 0

vaettchen
vaettchen

Reputation: 7659

Based upon the information you provide, something like that should do the job:

# bring your data.frames into a list:
f <- list( F0001, F0002 )
> f
[[1]]
  PoseID  Score
1 AAAA_1 -13.70
2 AAAA_2  -9.21
3 AAAA_3  -7.60
4 AAAA_4  -6.28

[[2]]
  PoseID  Score
1 AAAB_1 -14.90
2 AAAB_2 -13.92
3 AAAB_3 -13.49
4 AAAB_4 -11.95

# subset per list item
for( i in 1 : length( f ) ) 
    f[[ i ]] <- f[[ i ]][ f[[ i ]][ 2 ] < cutoffs[ i, 2 ], ]
> f
[[1]]
  PoseID Score
1 AAAA_1 -13.7

[[2]]
  PoseID  Score
1 AAAB_1 -14.90
2 AAAB_2 -13.92
3 AAAB_3 -13.49
4 AAAB_4 -11.95

Not sure what you mean with "above cut-off", maybe you have to reverse the less-than < operation. I also assume that in cutoffs, the data have exactly the same order as in the list of data.frames, otherwise some additional operation to identify the corresponding cut-off may be necessary.

Upvotes: 1

Related Questions