Reputation: 11
I have a dataset with many columns but I will only mention the ones that are necessary for this operation and provide a provisional one (I also believe you do not need the ID information I just included it to make it easier to comprehend).
Business Division | Local Claim ID| CMB
GC 123 **Y**
GC 124 N
NAC 125 N
NAC 126 N
NAC 127 **Y**
GC 128 N
and I would like to get rid of the CMB column while I would replace the CMB value for each business division if the original value is Y, basically I would want the table to look like following: (Business Divison now 3 classes)
Business Division | Local Claim ID
**CMB** 123
GC 124
NAC 125
NAC 126
**CMB** 127
GC 128
Here is the output of dput
to reproduce my data:
structure(list(Business.Division = c("CMB", "GC", "NAC", "NAC",
"CMB", "GC"), Local.Claim.ID = 123:128, CMB = c("Y", "N",
"N", "N", "Y", "N")), .Names = c("Business.Division", "Local.Claim.ID",
"CMB"), row.names = c(NA, -6L), class = "data.frame")
Upvotes: 1
Views: 185
Reputation: 2541
You can use ifelse
alone (without any additional packages) for this purpose:
df$Business.Division = ifelse(df$CMB == "Y", "CMB", df$Business.Division)
As DavidArenburg points out in the comments below, this is not efficient for working with big data. But if your data isn't huge, this is a nice, simple way to go.
Upvotes: 2
Reputation: 2190
Or (only works if businessdivision
isn't a factor
)
df$businessdivision[which(df$CMB == "Y")] = "CMB"
Upvotes: 1
Reputation: 21621
Try:
df %>%
mutate(Business.Division = replace(Business.Division, which(CMB == 'Y'), 'CMB')) %>%
select(-CMB)
Which gives:
# Business.Division Local.Claim.ID
#1 CMB 123
#2 GC 124
#3 NAC 125
#4 NAC 126
#5 CMB 127
#6 GC 128
Benchmark
Updated to add suggestions on the benchmark:
df <- data.frame(Business.Division = sample(c("GC", "NAC"), 10e6, replace = TRUE),
Local.Claim.ID = sample(100:199, 10e6, replace = TRUE),
CMB = sample(c("Y", "N"), 10e6, replace = TRUE),
stringsAsFactors = FALSE)
library(microbenchmark)
mbm <- microbenchmark(
me = mutate(df, Business.Division = replace(Business.Division,which(CMB == "Y"), "CMB")),
stevensp = (df$Business.Division <- ifelse(df$CMB == "Y", "CMB", df$Business.Division)),
mts = (df$Business.Division[which(df$CMB == "Y")] = "CMB"),
david1 = setDT(df)[CMB == "Y", Business.Division := "CMB"],
david2 = setkey(setDT(df), CMB)[.("Y"), Business.Division := "CMB"],
times = 10
)
David's is much faster:
> mbm
Unit: milliseconds
expr min lq mean median uq max neval cld
me 496.79251 556.70752 592.35165 608.23875 634.88809 661.33805 10 b
stevensp 3449.53516 3518.47649 3585.91006 3572.62433 3681.19332 3718.06284 10 c
mts 591.22479 654.01000 661.02210 661.41281 679.53060 719.74752 10 b
david1 58.67554 62.15468 66.85337 62.31426 62.99337 92.49148 10 a
david2 86.04280 89.42500 117.76540 89.61656 89.79652 232.45398 10 a
Upvotes: 3
Reputation: 92292
I would go with data.table
if you want to both evaluate only the relevant rows and update in place
library(data.table)
setDT(df)[CMB == "Y", Business.Division := "CMB"][, CMB := NULL]
# Business.Division Local.Claim.ID
# 1: CMB 123
# 2: GC 124
# 3: NAC 125
# 4: NAC 126
# 5: CMB 127
# 6: GC 128
Upvotes: 4