Reputation: 391
all,
I have a large data set (over 2 million rows), and in one of the columns I have the following levels:
"0" "0.001" "1" "4" "4.001" "8.001"
I want to make a new column where each of those has a new, corresponding letter:
0 = x, 0.001 = D, 1 = C, 4 and 4.001 = B, and 8.001 = A
Is there a way to do this without using a for loops with 6 if statements? I tried that, and it was taking forever to run.
Here's a test sample:
a b
1 0.000 x
2 4.000 B
3 1.000 C
4 0.001 D
5 1.000 C
6 4.000 B
7 4.001 B
8 1.000 C
9 8.001 A
Thank you.
Upvotes: 0
Views: 2246
Reputation: 109
I would try this, not shure about the runtime though:
library(forcats)
df = data.frame(a = c("0", "0.001", "1", "4", "4.001", "8.001"))
df$b <- fct_recode(df$a,
X = "0",
D = "0.001",
C = "1",
B = "4",
B = "4.001",
A = "8.001")
Upvotes: 0
Reputation: 887138
The easiest way would be to create a key/value dataset and join with the original data
keyval <- data.frame(a = c(0, 0.001, 1, 4, 4.001, 8.001),
b = c('x', 'D', 'C', 'B', 'B', 'A'), stringsAsFactors= FALSE)
library(data.table)
setDT(df1)[keyval, b := b, on = .(a)]
df1
# a b
#1: 0.000 x
#2: 4.000 B
#3: 1.000 C
#4: 0.001 D
#5: 1.000 C
#6: 4.000 B
#7: 4.001 B
#8: 1.000 C
#9: 8.001 A
df1 <- structure(list(a = c(0, 4, 1, 0.001, 1, 4, 4.001, 1, 8.001)),
.Names = "a", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"), class = "data.frame")
Upvotes: 2
Reputation: 1256
I do not believe there is a single line command that can do it for you. BTW for
loops by nature are inefficient and not recommended for large data sets.
Option 1:
What you may want to try is logical indexing
which is a statistical implementation of bit array.
idx<- df$a == "0.000"
df$NewColumn[idx] <- "x"
idx<- df$a == "4.000"
df$NewColumn[idx] <- "B"
and so on and so forth...
Option 2:
Use plyr
and revalue
which is a simpler implementation however could be more compute intensive than option 1. Should still easily work for your data size.
library(plyr)
df$NewColumn <- revalue(df$a, c(0 = "x", 0.001 = "D", 1 = "C", 4 = "B", 4.001 = "B", and 8.001 = "A"))
For either option, make sure that the data type class
is provided correctly. From your example, its hard for me to tell if the data is factor
or numeric
but either ways, its a simple change to manage in my sample code.
Upvotes: 1
Reputation: 155
Try as.factor (x, levels=c (whatever levels and values separated by comma))
Upvotes: 0