Reputation: 281
I'm hoping to find a nice way to compare a lot of data with 1 specific row of data. I don't need to compare all rows with all other rows, just the 1 specific row. I have a large dataset (1040 observations of 10808 variables) of genetic data with one base pair per column. I've attached images of my dataset since I think that's the easier way to show what I am working with:
Sequence names are on the left, and many columns are empty, but eventually there is genetic data:
I need to compare each position of each sequence to the sequence in the first position (L19088.1). This is the "canonical" sequence and I am interested in sequences with similarities and differences at the individual base pair level. If the value is the same as the canonical, nothing needs to change (the value can just stay as is). If the value is different than the canonical, I'd like to paste an "X" in that location. At the end of this, I plan to add up the X's in each column and continue the analysis from there. I'd like to keep this all in R.
In words, I want to do something like:
for each position in df { if value == canonical value, copy value. if not, paste "X" }
attempting code:
for (i in seq_along(df){
ifelse(i == df$Sequence[1], paste[i], paste["X"])
}
But of course this isn't working. I'm not sure how to specify the position of the canonical sequence in a loop. Any suggestions would be greatly appreciated!!
Upvotes: 0
Views: 46
Reputation: 1812
A simple way to go about this is to iterate over columns. With canon
as a single reference value, and col
the name of the column, we can write "X"
to all deviant values as such:
df[,col][ df[,col] != canon ] <- 'X'
Since the canonical value is in uppercase, while the rest is lowercase, we have to make canon
lowercase first, and we need to exclude the first line when placing our X-es:
canon = tolower(canon)
df[-1,col][ df[-1,col] != canon ] <- 'X'
Full code
Working full code including a loop over all requested columns:
# create miniature example dataset
df <- data.frame(Sequence = c('L19088.1','chr1_1','chr1_2',
'chr1_3','chr1_4'),
a = c('-','-','-','c','-'),
b = c('G','g','a','g','g'),
c = c('C','c','t','a','-') )
# Sequence a b c
# 1 L19088.1 - G C
# 2 chr1_1 - g c
# 3 chr1_2 - a t
# 4 chr1_3 c g a
# 5 chr1_4 - g -
# choose columns to operate on
columns <- colnames(df)[-1] # all columns minus first one
# [1] "a" "b" "c"
# iterate over chosen columns
for(col in columns) {
# optional: view status of column before edit
cat('old:', df[,col], '\n')
# get lowercase of canon value
canon = tolower(df[1,col])
# select df minus first row, column `col`, create a selection
# of that where the value is *not* equal to the canon value,
# and write 'X' to that selection
df[-1,col][ df[-1,col] != canon ] <- 'X'
# optional: view status of column after edit
cat('new:', df[,col], '\n\n')
}
# old: - - - c -
# new: - - - X -
#
# old: G g a g g
# new: G g X g g
#
# old: C c t a -
# new: C c X X X
Resulting df
:
> df
Sequence a b c
1 L19088.1 - G C
2 chr1_1 - g c
3 chr1_2 - X X
4 chr1_3 X g X
5 chr1_4 - g X
Sidenotes
There are a few particularities in your dataframe to keep in mind:
Upvotes: 1