Reputation: 51

Defining column value based on a set of columns

suppose I have the following data:

data <- data.frame(Opt_1 = c("AA", "GG", "CC", "GG"),
                   Opt_2 = c("AT", "GC", "CG", "CG"),
                   Opt_3 = c("TA", "CG", "GC", "GC"),
                   Opt_4 = c("TT", "CC", "GG", "CC"),
                   prob_1 = c(0.025, 0.025, 0.925, 0.025),
                   prob_2 = c(0.025, 0.925, 0.025, 0.025),
                   prob_3 = c(0.925, 0.025, 0.025, 0.025),
                   prob_4 = c(0.025, 0.025, 0.025, 0.925))

  Opt_1 Opt_2 Opt_3 Opt_4 prob_1 prob_2 prob_3 prob_4
     AA    AT    TA    TT  0.025  0.025  0.925  0.025
     GG    GC    CG    CC  0.025  0.925  0.025  0.025
     CC    CG    GC    GG  0.025  0.025  0.025  0.925
     GG    CG    GC    CC  0.925  0.025  0.025  0.025

I would like to define a new column that captures, for each row what is the Opt with maximum prob. Giving an output like the following:


  Opt_1 Opt_2 Opt_3 Opt_4 prob_1 prob_2 prob_3 prob_4 out
     AA    AT    TA    TT  0.025  0.025  0.925  0.025  TA
     GG    GC    CG    CC  0.025  0.925  0.025  0.025  GC
     CC    CG    GC    GG  0.025  0.025  0.025  0.925  GG
     GG    CG    GC    CC  0.925  0.025  0.025  0.025  GG

I wrote a for() loop but it is taking forever as the data base is pretty large.

Upvotes: 1

Answers (4)

Jules Stuifbergen

Reputation: 96

Oneliner in base R:

data$max <- apply(data, 1, function(x) { x[which(x == max(x[5:8]))[1] -4] })

gives (on your data):

> data
  Opt_1 Opt_2 Opt_3 Opt_4 prob_1 prob_2 prob_3 prob_4 max
1    AA    AT    TA    TT  0.025  0.025  0.925  0.025  TA
2    GG    GC    CG    CC  0.025  0.925  0.025  0.925  GC
3    CC    CG    GC    GG  0.025  0.025  0.025  0.025  CC
4    GG    CG    GC    CC  0.925  0.025  0.025  0.025  GG

Upvotes: 2

ThomasIsCoding

Reputation: 101064

Here is a base R solution

data$out <- mapply(function(i,j) data[grep("Opt_",names(data))][i,j],
                   seq(nrow(data)),
                   max.col(data[grep("prob_",names(data))]))

such that

> data
  Opt_1 Opt_2 Opt_3 Opt_4 prob_1 prob_2 prob_3 prob_4 out
1    AA    AT    TA    TT  0.025  0.025  0.925  0.025  TA
2    GG    GC    CG    CC  0.025  0.925  0.025  0.025  GC
3    CC    CG    GC    GG  0.025  0.025  0.025  0.925  GG
4    GG    CG    GC    CC  0.925  0.025  0.025  0.025  GG

Upvotes: 1

d.b

Reputation: 32548

nm = paste0("Opt_", max.col(data[paste0("prob_", 1:4)], ties.method = "first"))
as.character(sapply(seq_along(nm), function(i) data[i, nm[i]]))
#[1] "TA" "GC" "CC" "GG"

Upvotes: 3

r2evans

Reputation: 160407

Here's a tidyverse solution:

library(dplyr)
library(tidyr)

# uniquely identify each row, for joining back in later
dat$rn <- seq_len(nrow(dat))

as_tibble(dat) %>%
  pivot_longer(
    cols = -rn,
    names_to = c(".value", "set"),
    names_pattern = "(.+)_(.)"
  ) %>%
  group_by(rn) %>%
  slice(which.max(prob)) %>%
  select(rn, out = Opt) %>%
  left_join(dat, ., by = "rn")
#   Opt_1 Opt_2 Opt_3 Opt_4 prob_1 prob_2 prob_3 prob_4 rn out
# 1    AA    AT    TA    TT  0.025  0.025  0.925  0.025  1  TA
# 2    GG    GC    CG    CC  0.025  0.925  0.025  0.925  2  GC
# 3    CC    CG    GC    GG  0.025  0.025  0.025  0.025  3  CC
# 4    GG    CG    GC    CC  0.925  0.025  0.025  0.025  4  GG

Upvotes: 1

Defining column value based on a set of columns

Answers (4)

Related Questions