Danny
Danny

Reputation: 101

Conditional Subset of Rows from Two Separate Data Frames

I have two data frames that are arranged so that qseqid and sseqid are identical in each row:

Data frame 1

  qseqid               evalue bitscore
1 ENSDARP00000000004.2 1e-162 469.0
2 ENSDARP00000000005.6 0e+00 856.0
3 ENSDARP00000000042.9 0e+00 1272.0
4 ENSDARP00000000069.7 3e-111 333.0
5 ENSDARP00000000070.5 2e-11 58.2

Data frame 2

  sseqid               evalue bitscore
1 ENSDARP00000000004.2 9e-160 462.0
2 ENSDARP00000000005.6 0e+00 821.0
3 ENSDARP00000000042.9 0e+00 1293.0
4 ENSDARP00000000069.7 4e-102 310.0
5 ENSDARP00000000070.5 1e-11 58.2

I would like to create a single data frame which contains the row from the data frame (so I can keep the bit score) which has the lowest e-value, and if the evalues are the same in both data frames, skip that row. My results would look like this:

1 ENSDARP00000000004.2 1e-162 469.0
2 ENSDARP00000000069.7 3e-111 333.0
5 ENSDARP00000000070.5 1e-11 58.2

Code for minimal example:

qseqid <- c("ENSDARP00000000004.2",
            "ENSDARP00000000005.6",
            "ENSDARP00000000042.9",
            "ENSDARP00000000069.7",
            "ENSDARP00000000070.5")
evalue <- c(1e-162, 0e+00, 0e+00, 3e-111, 2e-11 )
bitscore <- c(469.0, 856.0, 1272.0, 333.0, 58.2)
df_1 <- data.frame(qseqid, evalue, bitscore)

sseqid <- c("ENSDARP00000000004.2",
            "ENSDARP00000000005.6",
            "ENSDARP00000000042.9",
            "ENSDARP00000000069.7",
            "ENSDARP00000000070.5")  
evalue <- c(9e-160, 0e+00, 0e+00, 4e-102, 1e-11)
bitscore <- c(462.0, 821.0, 1293.0, 310.0, 58.2)
df_2 <- data.frame(sseqid, evalue, bitscore)

My first thought was to use ifelse(), but this approach failed because I wasn't sure how to pass the row if the evalues were equal between data frames, and I wasn't was sure how to return a successful row from either data frame.

filtered_df <- ifelse(df_1$evalue == df_2$evalue,
                  next, ifelse(df_1$evalue < df_2s$evalue, 
                      successful df_1 row here,
                      successful df_2 row here)) 

My second idea was to combine the data frames using rbind(), and then using aggregate() to find the minimum value, and finally, merge the results into a single data frame. However, this didn't end up working the real data set. It's about 12k lines, and I ended up with too many duplicate values. An elegant solution would be greatly appreciated.

Upvotes: 1

Views: 85

Answers (3)

Pablo Adames
Pablo Adames

Reputation: 644

Here is a an option using functional style. Step 1) Use logic discribed in the question to generate a list of valid rows (as lists) or NULL otherwise. Step 2) filter out the null lists. Step 3) recover a data.frame with the answer.

#Step 1:
a <- mapply( function(name1,name2,evalue1,evalue2,bitscore1,bitscore2) {
               if( name1==name2 )
                 if ( evalue1 == evalue2 )
                   NULL
                 else {
                   minEvalue <- min(evalue1,evalue2)
                   keepBitScore <- ifelse(evalue1==minEvalue, bitscore1,bitscore2)
                   list(qseqid=name1,evalue=minEvalue,bitscore=keepBitScore)
                 }
              }, 
              df_1[,1],df_2[,1], df_1[,2],df_2[,2],df_1[,3],df_2[,3])

#Step 2:
Filter(Negate(function(x) is.null(unlist(x))), a)

#Step 3:
ans<-do.call(rbind.data.frame,a)

Gives:

                 qseqid evalue bitscore
2  ENSDARP00000000004.2 1e-162    469.0
21 ENSDARP00000000069.7 3e-111    333.0
3  ENSDARP00000000070.5  1e-11     58.2

My answer loses the original row names. I am sure they can be kept with some tweaks. I just like straight functional style whenever possible.

Upvotes: 1

NT_
NT_

Reputation: 641

It's a good chance to use the power of dplyr package.

First of all, let's make the same titles for dataframes and preserve "qseqid"/"sseqid" as a new variable.

library(dplyr)

df_1 <- df_1 %>% mutate('type' = 'qseqid') %>% rename('instance' = 'qseqid')
df_2 <- df_2 %>% mutate('type' = 'sseqid') %>% rename('instance' = 'sseqid')

Then, we can easily bind dataframes together and process it in a following way:

res <- df_1 %>% bind_rows(df_2) %>% 

    #calculate standart deviation of "evalue" within created groups

    group_by(instance) %>% mutate('diff_e' = sd(evalue)) %>% 

    #select rows with the following logic: std non-equal zero and select minimal within created groups

          filter(diff_e != 0 & evalue == min(evalue))

For more info about dplyr please consider the following book: http://r4ds.had.co.nz

Upvotes: 0

Arkadii Kuznetsov
Arkadii Kuznetsov

Reputation: 479

# Copying the first data.frame
df_3 <- df_1
# Replacing with the values from the second data.frame
# If the values in the second are less than in the first
m <- df_1$evalue > df_2$evalue
df_3[m,] <- df_2[m, ]

# Leave only unique values
df_4 <- df_3[df_1$evalue != df_2$evalue,]
df_4
#                 qseqid evalue bitscore
# 1 ENSDARP00000000004.2 1e-162    469.0
# 4 ENSDARP00000000069.7 3e-111    333.0
# 5 ENSDARP00000000070.5  1e-11     58.2

Upvotes: 5

Related Questions