Reputation: 1560
I have two dataframes, one that looks like this:
>df1
SNP Symbols
1 rs11807834 GRIN1,SETD1A
2 rs3729986 MADD,STAC3,SPI1
3 rs61937595 NDUFA4L2,STAC3,CAMK2N1
and another that looks like this
>df2
Symbol Score
1 GRIN1 167
2 SETD1A 160
3 MADD 164
4 STAC3 12
5 CAMK2N1 3
6 NDUFA4L2 0
7 SPI1 0
I want to get the Symbol with the highest score for each SNP
column, and then find the "proportion" that score has. So it would look like this:
>result
SNP Symbols Highest.Score Score.Proportion
rs11807834 GRIN1,SETD1A GRIN1 0.51
rs2600490 MADD,STAC3,SPI1 MADD 0.91
rs3729986 NDUFA4L2,STAC3,CAMK2N1 STAC3 0.8
Any suggestions how to achieve this?
df1 <- data.frame("SNP" = c("rs11807834", "rs3729986", "rs61937595" ), "Symbols" = c("GRIN1,SETD1A", "MADD,STAC3,SPI1", "NDUFA4L2,STAC3,CAMK2N1"))
df2 <- data.frame("Symbol" = c("GRIN1", "SETD1A", "MADD", "STAC3", "CAMK2N1", "NDUFA4L2", "SPI1"), "Score" = c(167, 160, 164,12,3,0,0))
Upvotes: 0
Views: 16
Reputation: 56004
Separate genes to rows, group by snp, get proportion, then get rows with maximum scores:
library(dplyr)
library(tidyr)
df1 %>%
separate_rows(Symbols, sep = ",") %>%
left_join(df2, by = c("Symbols" = "Symbol")) %>%
group_by(SNP) %>%
mutate(Score.Proportion = Score/sum(Score)) %>%
slice_max(Score) %>%
rename(Highest.Score = Score) %>%
ungroup()
# # A tibble: 3 x 4
# SNP Symbols Highest.Score Score.Proportion
# <chr> <chr> <dbl> <dbl>
# 1 rs11807834 GRIN1 167 0.511
# 2 rs3729986 MADD 164 0.932
# 3 rs61937595 STAC3 12 0.8
Upvotes: 1