Compare 1 row of data to the rest of the data and track differences

Question

I'm hoping to find a nice way to compare a lot of data with 1 specific row of data. I don't need to compare all rows with all other rows, just the 1 specific row. I have a large dataset (1040 observations of 10808 variables) of genetic data with one base pair per column. I've attached images of my dataset since I think that's the easier way to show what I am working with:

Sequence names are on the left, and many columns are empty, but eventually there is genetic data:

I need to compare each position of each sequence to the sequence in the first position (L19088.1). This is the "canonical" sequence and I am interested in sequences with similarities and differences at the individual base pair level. If the value is the same as the canonical, nothing needs to change (the value can just stay as is). If the value is different than the canonical, I'd like to paste an "X" in that location. At the end of this, I plan to add up the X's in each column and continue the analysis from there. I'd like to keep this all in R.

In words, I want to do something like:

for each position in df { if value == canonical value, copy value. if not, paste "X" }

attempting code:

for (i in seq_along(df){
  ifelse(i == df$Sequence[1], paste[i], paste["X"])
}

But of course this isn't working. I'm not sure how to specify the position of the canonical sequence in a loop. Any suggestions would be greatly appreciated!!

Caspar V. · Accepted Answer

A simple way to go about this is to iterate over columns. With canon as a single reference value, and col the name of the column, we can write "X" to all deviant values as such:

df[,col][ df[,col] != canon  ] <- 'X'

Since the canonical value is in uppercase, while the rest is lowercase, we have to make canon lowercase first, and we need to exclude the first line when placing our X-es:

canon = tolower(canon)

df[-1,col][ df[-1,col] != canon  ] <- 'X'

Full code

Working full code including a loop over all requested columns:

# create miniature example dataset
df <- data.frame(Sequence = c('L19088.1','chr1_1','chr1_2',
                              'chr1_3','chr1_4'),
                a = c('-','-','-','c','-'),
                b = c('G','g','a','g','g'),
                c = c('C','c','t','a','-') )

#   Sequence a b c
# 1 L19088.1 - G C
# 2   chr1_1 - g c
# 3   chr1_2 - a t
# 4   chr1_3 c g a
# 5   chr1_4 - g -

# choose columns to operate on
columns <- colnames(df)[-1] # all columns minus first one

# [1] "a" "b" "c"

# iterate over chosen columns
for(col in columns) {
  
  # optional: view status of column before edit
  cat('old:', df[,col], '
')
  
  # get lowercase of canon value
  canon = tolower(df[1,col])
  
  # select df minus first row, column `col`, create a selection
  # of that where the value is *not* equal to the canon value,
  # and write 'X' to that selection
  df[-1,col][ df[-1,col] != canon  ] <- 'X'
  
  # optional: view status of column after edit
  cat('new:', df[,col], '

')
  
}

# old: - - - c - 
# new: - - - X - 
#
# old: G g a g g 
# new: G g X g g 
#
# old: C c t a - 
# new: C c X X X

Resulting df:

> df
  Sequence a b c
1 L19088.1 - G C
2   chr1_1 - g c
3   chr1_2 - X X
4   chr1_3 X g X
5   chr1_4 - g X

Sidenotes

There are a few particularities in your dataframe to keep in mind:

missing values appear to be "-" instead of NA
columns have numerical titles, which isn't allowed and won't always work
the canonical has capitalized letters, in contrast with the rest

Compare 1 row of data to the rest of the data and track differences

Answers (1)

Related Questions