Reputation: 33
Let's say I have a data frame in R that contains a column with sample names and three additional columns of different treatments that those samples received. The values in the treatment columns correspond to days that the treatment was given or have NA if they did not receive that treatment. How can I transform each row so that the numeric values are replaced by the order that the treatment was received. In other words, the column with the lowest value/day gets mutated into "first_treatment", the second lowest value into "second_treatment" and so on.
Example Input Data:
df = data.frame(samples = LETTERS[1:5],
Treatment_A = c(NA, "3", NA, "10", "12"),
Treatment_B = c("12", NA, NA, "15", "5"),
Treatment_C = c(NA, NA, "5", "8", NA))
Desired Output:
df_output = data.frame(samples = LETTERS[1:5],
Treatment_A = c(NA, "First", NA, "Second","Second"),
Treatment_B = c("First", NA, NA, "Third","First"),
Treatment_C = c(NA, NA, "First", "First", NA))
*edit: fixed what the output should look like.
Upvotes: 1
Views: 91
Reputation: 19191
An approach using cut
for grouping
library(dplyr)
library(tidyr)
df %>%
pivot_longer(-samples) %>%
mutate(value = cut(as.numeric(value),
breaks = seq(min(as.numeric(value), na.rm=T),
max(as.numeric(value), na.rm=T),
length.out=4),
include.lowest=T,
labels=c("First", "Second", "Third"))) %>%
pivot_wider()
# A tibble: 5 × 4
samples Treatment_A Treatment_B Treatment_C
<chr> <fct> <fct> <fct>
1 A NA Third NA
2 B First NA NA
3 C NA NA First
4 D Second Third Second
5 E Third First NA
Upvotes: 0
Reputation: 73572
We can create a small assignment vector, calculate rowRanks
and match them with it.
> v <- c(first=1, second=2, third=3)
> m <- matrixStats::rowRanks(as.matrix(type.convert(df[-1], as.is=TRUE)))
> m[] <- names(v)[match(m, v)]
> data.frame(df['samples'], m)
samples Treatment_A Treatment_B Treatment_C
1 A <NA> first <NA>
2 B first <NA> <NA>
3 C <NA> <NA> first
4 D second third first
5 E second first <NA>
Upvotes: 0
Reputation: 7540
A possible option using label_ordinal()
if you don't mind an automated "1st", "2nd" etc. as an alternative to "first", "second" etc.
row_number
can be switched dense_rank
or min_rank
depending on the ties method preference.
library(dplyr)
library(scales)
df |>
mutate(
across(-samples, \(x) row_number(as.numeric(x))),
across(-samples, label_ordinal())
)
#> samples Treatment_A Treatment_B Treatment_C
#> 1 A <NA> 2nd <NA>
#> 2 B 1st <NA> <NA>
#> 3 C <NA> <NA> 1st
#> 4 D 2nd 3rd 2nd
#> 5 E 3rd 1st <NA>
Created on 2024-05-07 with reprex v2.1.0
Upvotes: 0
Reputation: 20494
The basic logic is to rank()
the columns. In this case we want the "keep"
option for the na.last
parameter:
a logical or character string controlling the treatment of
NA
s. IfTRUE
, missing values in the data are put last; ifFALSE
, they are put first; ifNA
, they are removed; if"keep"
they are kept with rankNA
.
As you want to use "First"
rather than 1
etc. we can create an ordered character vector of ranks and then subset this by rank.
# This could be longer...
ranks <- c("First", "Second", "Third", "Fourth", "Fifth")
# Your input is character rather than numeric
df <- type.convert(df, as.is = TRUE)
cols <- grep("^Treatment", names(df), value = TRUE)
df[cols] <- lapply(df[cols], \(x) ranks[rank(x, na.last = "keep")])
df
# samples Treatment_A Treatment_B Treatment_C
# 1 A <NA> Second <NA>
# 2 B First <NA> <NA>
# 3 C <NA> <NA> First
# 4 D Second Third Second
# 5 E Third First <NA>
identical(df, df_output)
# [1] TRUE
tidyverse
approachThe same basic approach can be taken using dplyr
and readr
:
library(dplyr)
df |>
readr::type_convert() |>
mutate(
across(
Treatment_A:Treatment_C,
\(x) ranks[rank(x, na.last = "keep")]
)
)
# samples Treatment_A Treatment_B Treatment_C
# 1 A <NA> Second <NA>
# 2 B First <NA> <NA>
# 3 C <NA> <NA> First
# 4 D Second Third Second
# 5 E Third First <NA>
Upvotes: 1