Reputation: 91

Check if column value from one dataframe is in between (range) of two other columns of second dataframe

I have two dataframes of different sizes:

df1<-data.frame(Chr = c(1, 1,2,3,4),
                Start = c(15,120, 210,210,450),
                End = c(15,130, 210,210,450),
                Gene=c("gene1","gene2","gene3","gene3","gene3"),
                sample_id=c("ss6","ss7","ss9","ss9","ss10"))
      
  df2 <- data.frame(Chr = c(1, 1,3),
                    Start = c(10,100, 200),
                    End = c(50,200, 250),
                    Gene=c("gene1","gene2","gene3"),
                    sample_id=c("ss1","ss1","ss1"))

I would like to take the Start from df1 and check to see if it is between the range of Start-End of df2 whilst at the same time making sure the Chr is the same (the sample_id does not have to match). If it is then add a column to df1 ideally with df2$sample_id but if this is not possible then YES (or NA for no match). It is similar to this question but I also need to match 'Chr' Only checking range

It is also similar to this question and I know it should be easier as I don't want to match respective rows Check if column value is in between (range) of two other column values

I have tried:

df1 %>%
  mutate(no_coverage_in = case_when(df2$Start <= Start  & df2$End >=Start & Chr == df2$Chr ~ df2$sample_id ))

But it complains

longer object length is not a multiple of shorter object length

Upvotes: 1

Answers (4)

Freddie J. Heather

Reputation: 192

I believe this gives you your desired result:


df1 %>%
  left_join(df2 %>% rename_at(vars(Start, End, sample_id), paste0, "_2")) %>%
  mutate(sample_id_new = case_when(Start < End_2 & Start > Start_2 ~ sample_id_2)) %>% 
  select(Chr, Start, End, Gene, sample_id, sample_id_new)

Output:

  Chr Start End  Gene sample_id sample_id_new
1   1    15  15 gene1       ss6           ss1
2   1   120 130 gene2       ss7           ss1
3   2   210 210 gene3       ss9          <NA>
4   3   210 210 gene3       ss9           ss1
5   4   450 450 gene3      ss10          <NA>

Upvotes: 1

jay.sf

Reputation: 73802

You could write a small FUNction that does the checks for each row of df1 and put it in an lapply that loops over its rows.

FUN <- \(x, y) {
  rng <- df1[x, 2] >= y[, 2] & df1[x, 3] < y[, 3]
  chr <- df1[x, 1] == y[, 1]
  if (any(rng & chr)) df2[which(rng), 5] else NA
}

df1 <- transform(df1, match=unlist(lapply(seq.int(nrow(df1)), FUN, df2)))
df1
#   Chr Start End  Gene sample_id match
# 1   1    15  15 gene1       ss6   ss1
# 2   1   120 130 gene2       ss7   ss1
# 3   2   210 210 gene3       ss9  <NA>
# 4   3   210 210 gene3       ss9   ss1
# 5   4   450 450 gene3      ss10  <NA>

Note:

I used the new shorthand notation for creating functions in R>4.1.*. For older R versions, instead of FUN <- \(x, y), use FUN <- function(x, y) or update R.

Upvotes: 1

SBMVNO

Reputation: 642

here is a suggestion.

  df1$match= sapply( 1:nrow(df1) , 
                     function(x)   
                          any(  df1[x, 'Chr']==df2[, 'Chr'] &
                                df1[x , 'Start'] <= df2[ , 'End'] & 
                                df1[x , 'Start'] >= df2[ , 'Start'] ))

Upvotes: 1

Daman deep

Reputation: 631

Is this what you desire?

Given data frames
> df1
  Chr Start End  Gene sample_id
1   1    15  15 gene1       ss6
2   1   120 130 gene2       ss7
3   2   210 210 gene3       ss9
4   3   210 210 gene3       ss9
5   4   450 450 gene3      ss10
> df2
  Chr Start End  Gene sample_id
1   1    10  50 gene1       ss1
2   1   100 200 gene2       ss1
3   3   200 250 gene3       ss1

vec2 <- c()
for (k in 1:nrow(df1)) {
  if (df1$Chr[k] %in% df2$Chr)  {
    vec <- which(df2$Chr==df1$Chr[k])  
    for (m in 1:length(vec)) {
        if (df1$Start[k]<df2$Start[m] &df1$End[k]<df2$End[m]) {
          vec2[k] <- "Yes"
          
        }else{
          vec2[k] <- "No"
        }
    }
  }else{
    vec2[k] <- "No"
  }
}
df1$Results <- vec2

output

> df1
  Chr Start End  Gene sample_id Results
1   1    15  15 gene1       ss6     Yes
2   1   120 130 gene2       ss7      No
3   2   210 210 gene3       ss9      No
4   3   210 210 gene3       ss9      No
5   4   450 450 gene3      ss10      No

Upvotes: 1

Check if column value from one dataframe is in between (range) of two other columns of second dataframe

Answers (4)

Note:

Related Questions