Reputation: 91
I have two dataframes of different sizes:
df1<-data.frame(Chr = c(1, 1,2,3,4),
Start = c(15,120, 210,210,450),
End = c(15,130, 210,210,450),
Gene=c("gene1","gene2","gene3","gene3","gene3"),
sample_id=c("ss6","ss7","ss9","ss9","ss10"))
df2 <- data.frame(Chr = c(1, 1,3),
Start = c(10,100, 200),
End = c(50,200, 250),
Gene=c("gene1","gene2","gene3"),
sample_id=c("ss1","ss1","ss1"))
I would like to take the Start from df1 and check to see if it is between the range of Start-End of df2 whilst at the same time making sure the Chr is the same (the sample_id does not have to match). If it is then add a column to df1 ideally with df2$sample_id but if this is not possible then YES (or NA for no match). It is similar to this question but I also need to match 'Chr' Only checking range
It is also similar to this question and I know it should be easier as I don't want to match respective rows Check if column value is in between (range) of two other column values
I have tried:
df1 %>%
mutate(no_coverage_in = case_when(df2$Start <= Start & df2$End >=Start & Chr == df2$Chr ~ df2$sample_id ))
But it complains
longer object length is not a multiple of shorter object length
Upvotes: 1
Views: 1899
Reputation: 192
I believe this gives you your desired result:
df1 %>%
left_join(df2 %>% rename_at(vars(Start, End, sample_id), paste0, "_2")) %>%
mutate(sample_id_new = case_when(Start < End_2 & Start > Start_2 ~ sample_id_2)) %>%
select(Chr, Start, End, Gene, sample_id, sample_id_new)
Output:
Chr Start End Gene sample_id sample_id_new
1 1 15 15 gene1 ss6 ss1
2 1 120 130 gene2 ss7 ss1
3 2 210 210 gene3 ss9 <NA>
4 3 210 210 gene3 ss9 ss1
5 4 450 450 gene3 ss10 <NA>
Upvotes: 1
Reputation: 73802
You could write a small FUN
ction that does the checks for each row of df1
and put it in an lapply
that loops over its rows.
FUN <- \(x, y) {
rng <- df1[x, 2] >= y[, 2] & df1[x, 3] < y[, 3]
chr <- df1[x, 1] == y[, 1]
if (any(rng & chr)) df2[which(rng), 5] else NA
}
df1 <- transform(df1, match=unlist(lapply(seq.int(nrow(df1)), FUN, df2)))
df1
# Chr Start End Gene sample_id match
# 1 1 15 15 gene1 ss6 ss1
# 2 1 120 130 gene2 ss7 ss1
# 3 2 210 210 gene3 ss9 <NA>
# 4 3 210 210 gene3 ss9 ss1
# 5 4 450 450 gene3 ss10 <NA>
I used the new shorthand notation for creating functions in R>4.1.*. For older R versions, instead of FUN <- \(x, y)
, use FUN <- function(x, y)
or update R.
Upvotes: 1
Reputation: 642
here is a suggestion.
df1$match= sapply( 1:nrow(df1) ,
function(x)
any( df1[x, 'Chr']==df2[, 'Chr'] &
df1[x , 'Start'] <= df2[ , 'End'] &
df1[x , 'Start'] >= df2[ , 'Start'] ))
Upvotes: 1
Reputation: 631
Is this what you desire?
Given data frames
> df1
Chr Start End Gene sample_id
1 1 15 15 gene1 ss6
2 1 120 130 gene2 ss7
3 2 210 210 gene3 ss9
4 3 210 210 gene3 ss9
5 4 450 450 gene3 ss10
> df2
Chr Start End Gene sample_id
1 1 10 50 gene1 ss1
2 1 100 200 gene2 ss1
3 3 200 250 gene3 ss1
vec2 <- c()
for (k in 1:nrow(df1)) {
if (df1$Chr[k] %in% df2$Chr) {
vec <- which(df2$Chr==df1$Chr[k])
for (m in 1:length(vec)) {
if (df1$Start[k]<df2$Start[m] &df1$End[k]<df2$End[m]) {
vec2[k] <- "Yes"
}else{
vec2[k] <- "No"
}
}
}else{
vec2[k] <- "No"
}
}
df1$Results <- vec2
output
> df1
Chr Start End Gene sample_id Results
1 1 15 15 gene1 ss6 Yes
2 1 120 130 gene2 ss7 No
3 2 210 210 gene3 ss9 No
4 3 210 210 gene3 ss9 No
5 4 450 450 gene3 ss10 No
Upvotes: 1