Reputation: 15
I am working on a data set (30000 x 500 ) where I need to replace some values in columns based on data from another column. The problem is that in each row, the reference values change. Here is an sub-example of the data set:
#Create a data frame
df <- data.frame(SNP = c("SNP1","SNP2","SNP3","SNP4","SNP5","SNP6","SNP7","SNP8","SNP9","SNP10"),
A_allele = c("C","G","C","G","C","C","A","T","G","C"),
B_allele = c("G","A","T","A","A","G","T","A","C","A"),
alleles = c("C/G","G/A","C/T","G/A","C/A","C/G","A/T","T/A","G/C","C/A"),
line_1 = sample(c("A","B"),10, replace = TRUE),
line_2 = sample(c("A","B"),10, replace = TRUE),
line_3 = sample(c("A","B"),10, replace = TRUE),
line_4 = sample(c("A","B"),10, replace = TRUE),
line_5 = sample(c("A","B"),10, replace = TRUE),
line_6 = sample(c("A","B"),10, replace = TRUE),
line_7 = sample(c("A","B"),10, replace = TRUE),
line_8 = sample(c("A","B"),10, replace = TRUE),
line_9 = sample(c("A","B"),10, replace = TRUE),
line_10 = sample(c("A","B"),10, replace = TRUE)
)
df
head(df)
SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 line_8 line_9 line_10
1 SNP1 C G C/G B A B A B B B B B A
2 SNP2 G A G/A A B A A A B B A B A
3 SNP3 C T C/T B B A B B B A A A A
4 SNP4 G A G/A A B B A B A B B B A
5 SNP5 C A C/A B A B B B A B A B B
6 SNP6 C G C/G B A B A B A B B B B
7 SNP7 A T A/T B A A B A A B A B A
8 SNP8 T A T/A A B A B A A B B A B
9 SNP9 G C G/C B A B B B B A B A B
10 SNP10 C A C/A B B B B B A A A A A
For each row, A_allele and B_allele columns serve as reference values to change A or B values in the 10 lines. When there is an "A" value => use the value from column A_allele and when there is a "B" value => use the value from column_B.
In the example, this should be as following:
The output should look something like this:
SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 line_8 line_9 line_10
1 SNP1 C G C/G G C G C G G G G G C
2 SNP2 G A G/A G A G G G A A G A G
3 SNP3 C T C/T T T C T T T C C C C
4 SNP4 G A G/A G A A G A G A A A G
5 SNP5 C A C/A A C A A A C A C A A
6 SNP6 C G C/G G C G C G C G G G G
7 SNP7 A T A/T T A A T A A T A T A
8 SNP8 T A T/A T A T A T T A A T A
9 SNP9 G C G/C C G C C C C G C G C
10 SNP10 C A C/A A A A A A C C C C C
As there are ~30000 rows, I would like an efficient code to run if it possible.
Any suggestions?
Upvotes: 0
Views: 37
Reputation: 26218
You can do
library(tidyverse)
df %>% mutate(across(starts_with("line"), ~ifelse(. == "A", str_sub(alleles, 1, 1), str_sub(alleles, 3, 3))))
#output with df generated with set.seed(2021)
SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 line_8 line_9 line_10
1 SNP1 C G C/G C C G C C C G G C G
2 SNP2 G A G/A A A A A G G G G G G
3 SNP3 C T C/T T T C C T T T T T C
4 SNP4 G A G/A A G A A A G G A G A
5 SNP5 C A C/A C C C A C A A C C A
6 SNP6 C G C/G G C C C C C G C G G
7 SNP7 A T A/T T A T T T T T A T A
8 SNP8 T A T/A A T A T A A A T A T
9 SNP9 G C G/C C C C C C G G G C C
10 SNP10 C A C/A A C A C A C C C C A
As stated in comments, if column name do not follow a pattern, Option-1 you can store these in a vector say vars and use this inside across
set.seed(2021)
df <- data.frame(SNP = c("SNP1","SNP2","SNP3","SNP4","SNP5","SNP6","SNP7","SNP8","SNP9","SNP10"),
A_allele = c("C","G","C","G","C","C","A","T","G","C"),
B_allele = c("G","A","T","A","A","G","T","A","C","A"),
alleles = c("C/G","G/A","C/T","G/A","C/A","C/G","A/T","T/A","G/C","C/A"),
line_1 = sample(c("A","B"),10, replace = TRUE),
line_2 = sample(c("A","B"),10, replace = TRUE),
line_3 = sample(c("A","B"),10, replace = TRUE),
line_4 = sample(c("A","B"),10, replace = TRUE),
line_5 = sample(c("A","B"),10, replace = TRUE),
line_6 = sample(c("A","B"),10, replace = TRUE),
line_7 = sample(c("A","B"),10, replace = TRUE),
cat = sample(c("A","B"),10, replace = TRUE),
dog = sample(c("A","B"),10, replace = TRUE),
rabbit = sample(c("A","B"),10, replace = TRUE)
)
vars <- c("line_1", "line_2", "line_3", "line_4", "line_5", "line_6", "line_7", "cat", "dog", "rabbit")
df %>% mutate(across(.cols = vars, ~ifelse(. == "A", str_sub(alleles, 1, 1), str_sub(alleles, 3, 3))))
Note: Using an external vector in selections is ambiguous.
i Use `all_of(vars)` instead of `vars` to silence this message.
i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 cat dog rabbit
1 SNP1 C G C/G C C G C C C G G C G
2 SNP2 G A G/A A A A A G G G G G G
3 SNP3 C T C/T T T C C T T T T T C
4 SNP4 G A G/A A G A A A G G A G A
5 SNP5 C A C/A C C C A C A A C C A
6 SNP6 C G C/G G C C C C C G C G G
7 SNP7 A T A/T T A T T T T T A T A
8 SNP8 T A T/A A T A T A A A T A T
9 SNP9 G C G/C C C C C C G G G C C
10 SNP10 C A C/A A C A C A C C C C A
Option-2 you may also column indexes directly
df %>% mutate(across(5:14, ~ifelse(. == "A", str_sub(alleles, 1, 1), str_sub(alleles, 3, 3))))
SNP A_allele B_allele alleles line_1 line_2 line_3 line_4 line_5 line_6 line_7 cat dog rabbit
1 SNP1 C G C/G C C G C C C G G C G
2 SNP2 G A G/A A A A A G G G G G G
3 SNP3 C T C/T T T C C T T T T T C
4 SNP4 G A G/A A G A A A G G A G A
5 SNP5 C A C/A C C C A C A A C C A
6 SNP6 C G C/G G C C C C C G C G G
7 SNP7 A T A/T T A T T T T T A T A
8 SNP8 T A T/A A T A T A A A T A T
9 SNP9 G C G/C C C C C C G G G C C
10 SNP10 C A C/A A C A C A C C C C A
Upvotes: 2
Reputation: 388807
You can use across
in dplyr
along with ifelse
.
library(dplyr)
df %>% mutate(across(starts_with('line'), ~ifelse(. == 'A', A_allele, B_allele)))
Or lapply
in base R :
cols <- grep('line', names(df))
df[cols] <- lapply(df[cols], function(x) ifelse(x == 'A', df$A_allele, df$B_allele))
Upvotes: 0