Amaranta_Remedios
Amaranta_Remedios

Reputation: 773

Replace values in a column of a df all at the same time

I have a very simple problem. I have a large dataframe. And I need to replace values in a column 2 (cluster) following this schema:

1 -> 3
2 -> 5
3 -> 1
5 -> 2

> dput(head(df))
structure(list(Target = c("TRINITY_GG_100011_c0_g1_i3.mrna1", 
"TRINITY_GG_100011_c0_g1_i5.mrna1", "TRINITY_GG_100011_c0_g1_i6.mrna1", 
"TRINITY_GG_100011_c0_g1_i9.mrna1", "TRINITY_GG_100016_c0_g1_i1.mrna1", 
"TRINITY_GG_100016_c0_g1_i2.mrna1"), cluster = c(2L, 5L, 5L, 
3L, 4L, 5L), AAA = c(9L, 7L, 8L, 7L, 
5L, 5L)), row.names = c(NA, 6L), class = "data.frame")

#normally I will do it like this:
df$cluster[df$cluster == 1]  <- 3

The problem is that once I change 1 for 3, the next time I got to change 3 for 1 that will change it again. So I can't approach this sequentially. I need something that will use the original number and change them all at once.

Upvotes: 3

Views: 63

Answers (2)

ThomasIsCoding
ThomasIsCoding

Reputation: 101044

A base R option using match + ifelse

p <- c(1,2,3,5)
q <- c(3,5,1,2)
transform(
  df,
  cluster = ifelse(cluster %in% q,p[match(cluster,q)],cluster)
)

gives

                            Target cluster AAA
1 TRINITY_GG_100011_c0_g1_i3.mrna1       5   9
2 TRINITY_GG_100011_c0_g1_i5.mrna1       2   7
3 TRINITY_GG_100011_c0_g1_i6.mrna1       2   8
4 TRINITY_GG_100011_c0_g1_i9.mrna1       1   7
5 TRINITY_GG_100016_c0_g1_i1.mrna1       4   5
6 TRINITY_GG_100016_c0_g1_i2.mrna1       2   5

Upvotes: 1

akrun
akrun

Reputation: 886938

We could use a named vector and replace

library(dplyr)
df %>%
   mutate(cluster = coalesce(setNames(c(3, 5, 1, 2),
         c(1, 2, 3, 5))[as.character(cluster)], cluster))

-output

#                            Target cluster AAA
#1 TRINITY_GG_100011_c0_g1_i3.mrna1       5   9
#2 TRINITY_GG_100011_c0_g1_i5.mrna1       2   7
#3 TRINITY_GG_100011_c0_g1_i6.mrna1       2   8
#4 TRINITY_GG_100011_c0_g1_i9.mrna1       1   7
#5 TRINITY_GG_100016_c0_g1_i1.mrna1       4   5
#6 TRINITY_GG_100016_c0_g1_i2.mrna1       2   5

One of the drawbacks is that it will return NA for elements that are not in the named vector. Inorder to return the original vector values whereever there are NAs returned, wrap with coalesce so that if there is a NA in the updated column, the corresponding value of the old vector is returned


Or can do a join with a key/value dataset

library(data.table)
setDT(df)[data.frame(cluster = c(1, 2, 3, 5), new = c(3, 5, 1, 2)), 
     cluster := new, on = .(cluster)]

Upvotes: 1

Related Questions