Reputation: 8494
Using the example dataframe:
df <- structure(list(KY27SCH1 = c(4, 4, 4, 4, NA, 5, 2, 4, 4, NA, 4,
5, 3, 5, 5), KY27SCH2 = c(5, 4, 4, 4, NA, 4, 1, 4, 4, NA, 4,
5, 4, 5, 5), KY27SCH3 = c(4, 4, 5, 4, NA, 4, 4, 4, 5, NA, 5,
5, 3, 5, 5), KY27SCH4 = c(3, 5, 5, 4, NA, 5, 4, 5, 5, NA, 5,
5, 4, 5, 5)), .Names = c("KY27SCH1", "KY27SCH2", "KY27SCH3",
"KY27SCH4"), row.names = 197:211, class = "data.frame")
I apply a line of code to add the four different columns together before binding this new column to the original dataframe:
KC27sc_R <- rowSums(df[, c("KY27SCH1", "KY27SCH2", "KY27SCH3", "KY27SCH4")], na.rm = TRUE)
df <- cbind(df, KC27sc_R) # Adds columns to survey dataframe
I then want to recode variable KC27sc_R using the table of results detailed below:
5= -4.287
6 = -3.040
7 = -2.405
8 = -1.960
9 = -1.605
10 = -1.296
11 = -1.011
12 = -0.735
13 = -0.456
14 = -0.168
15 = 0.134
16 = 0.454
17 = 0.796
18 = 1.166
19 = 1.574
20 = 2.035
21 = 2.582
22 = 3.299
23 = 4.594
i.e. 5 in column KC27sc_R would become -4.287.
Is there a way of recoding a column from a list of numbers without having to go through each number in turn? I normally use the recode function but I am unsure how to do this with a large list.
Any help would be appreciated.
Upvotes: 3
Views: 4852
Reputation: 288
For starters, let's suppose that your table of results is stored in a matrix yo
:
yo <- matrix(0, nrow = 19, ncol = 2)
yo[, 1] <- c(5:23)
yo[, 2] <- c( -4.287, -3.040, -2.405, -1.960, -1.605, -1.296, -1.011, -0.735, -0.456, -0.168, 0.134, 0.454, 0.796, 1.166, 1.574, 2.035, 2.582, 3.299, 4.594)
i.e., the first column of yo
corresponds to values you want to be changed, the second - what you are changing with. Put simply, treat yo
as a function - first column is this function's x variable, second is the function's output.
The first thing you need to figure out is the index of values of KC27sc_R are actually present in yo[, 1]
(in which rows you can actually substitute old values with the new ones). This is done like so:
ind <- which( df$KC27sc_R %in% yo[,1] )
ind
gives you the rows of all the values of KC27sc_R you can change. The next step would be to get all these values:
a <- df[ind,]$KC27sc_R
The final step is to link the values in a
with those in yo[, 1]
- literally find rows of yo[, 1]
where you can find each of the corresponding values in a
- function match
would be helpful here:
b <- match( a, yo[,1] )
Like ind
, b
is an index - for each value in a
, it tells you which row of yo[, 2]
you need to go to to find the replacement for this value in a
. Final step is to replace values in your df
:
df[ind, "KC27sc_R"] <- yo[b, 2]
That will do the trick.
Upvotes: 1
Reputation: 67818
If you keep your recoding values in a second data frame, you may try merge
:
# original data
df <- data.frame(x = sample(5:7, 10, replace = TRUE))
# recoding data
df2 <- data.frame(x = 5:7, new_x = c(-4.287, -3.040, -2.405))
merge(df, df2)
Edit following @hadley's comment
The solution above is testing floating point numbers for equality, which is it not a reliable method, see R FAQ 7.31 Why doesn't R think these numbers are equal?. One way to handle this is to set the columns used for merging to class "integer
". (I note that this potential issue is not considered in the last example in ?merge
)
# original data
df <- data.frame(x = as.integer(sample(5:7, 10, replace = TRUE)))
# recoding data
df2 <- data.frame(x = as.integer(5:7), new_x = c(-4.287, -3.040, -2.405))
merge(df, df2)
Edit following @Ananda Mahto's comment - funky sorting with merge and handling of NA
# original data with NA
df <- data.frame(x = as.integer(c(7, NA, 5, 6, NA, 5)))
# recoding data as above, without NA
merge(df, df2, sort = FALSE)
# 'unspecified' order and no match with NA
# can at least handle NA by including NA also in recoding data
df2 <- data.frame(x = as.integer(c(5:7, NA)), new_x = c(-4.287, -3.040, -2.405, NA))
merge(df, df2, sort = FALSE)
A possible merge-oid solution: join
in plyr
package
"Unlike merge
, join
preserves the order of x no matter what join type is used", and no need for NA in recoding data.
library(plyr)
df <- data.frame(x = as.integer(c(7, NA, 5, 6, NA, 5)))
df2 <- data.frame(x = as.integer(c(5:7)), new_x = c(-4.287, -3.040, -2.405))
join(df, df2)
# looks OK
From ?join
: "Join
is often faster than merge
". Whether this is the case here, and whether it is faster than match
, I leave for someone else to show.
Upvotes: 3
Reputation: 12326
An answer that requires a minimum of typing and is portable:
# Your original values
origval = seq(5,23)
newval = c(-4.287, -3.04, -2.405, -1.96, -1.605, -1.296, -1.011, -0.735, -0.456, -0.168, 0.134, 0.454, 0.796, 1.166, 1.574, 2.035, 2.582, 3.299, 4.594)
# generate a relationship
sp = smooth.spline(origval,newval)
# look up a value based on your original sequence
pval = predict(sp, origval)
Now pval$y
will contain the predicted (converted) points.
You can put any other set of values into the predict
function in place of origval
, in any order, even ones that weren't in the data series (5.5, etc)
Applied to your data set, you can create a placeholder for your variable and then "predict" its values:
df$KY_Rnew = df$KC27sc_R
df$KY_Rnew[!is.na(df$KY_Rnew)] = predict(sp,df$KY_Rnew[!is.na(df$KY_Rnew)])$y
Upvotes: 1
Reputation: 334
Assuming your mapping values are all integers, you could create a vector containing the coded values at the position of the mapping values:
# using mydf defined by Ananda Mahto:
mydf <- structure(list(V1 = c(5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23), V2 = c(-4.287, -3.04, -2.405,
-1.96, -1.605, -1.296, -1.011, -0.735, -0.456, -0.168, 0.134,
0.454, 0.796, 1.166, 1.574, 2.035, 2.582, 3.299, 4.594)), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -19L))
# create vector with index positions corresponding to objective values:
vmap <- rep(NA, length=max(mydf$V1))
vmap[mydf$V1] <- mydf$V2
vmap
# [1] NA NA NA NA -4.287 -3.040 -2.405 -1.960 -1.605 -1.296
# [11] -1.011 -0.735 -0.456 -0.168 0.134 0.454 0.796 1.166 1.574 2.035
# [21] 2.582 3.299 4.594
# Assign NA to zero values in KC27sc_R (as you cannot have a zero position in a R vector)
# (this could also be another value defined in mydf if you want zero to map to something)
KC27sc_R[KC27sc_R==0] <- NA
# Then, select the values in vmap using the indices defined in KC27sc_R:
Krecode <- vmap[KC27sc_R]
data.frame(KC27sc_R, Krecode)
# KC27sc_R Krecode
# 197 16 0.454
# 198 17 0.796
# 199 18 1.166
# 200 16 0.454
# 201 NA NA
# 202 18 1.166
# ... etc
This should be fairly fast with a long list as all operations are vectorized.
Upvotes: 1
Reputation: 193677
Assume we have put your lookup table as a data.frame
that looks like this:
mydf <- structure(list(V1 = c(5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23), V2 = c(-4.287, -3.04, -2.405,
-1.96, -1.605, -1.296, -1.011, -0.735, -0.456, -0.168, 0.134,
0.454, 0.796, 1.166, 1.574, 2.035, 2.582, 3.299, 4.594)), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -19L))
mydf
# V1 V2
# 1 5 -4.287
# 2 6 -3.040
# 3 7 -2.405
# 4 8 -1.960
# 5 9 -1.605
# 6 10 -1.296
# 7 11 -1.011
# 8 12 -0.735
# 9 13 -0.456
# 10 14 -0.168
# 11 15 0.134
# 12 16 0.454
# 13 17 0.796
# 14 18 1.166
# 15 19 1.574
# 16 20 2.035
# 17 21 2.582
# 18 22 3.299
# 19 23 4.594
You should be able to just use something like the following to get what you are looking for:
df$RECODED <- mydf$V2[match(as.character(df$KC27sc_R), as.character(mydf$V1))]
df
# KY27SCH1 KY27SCH2 KY27SCH3 KY27SCH4 KC27sc_R RECODED
# 197 4 5 4 3 16 0.454
# 198 4 4 4 5 17 0.796
# 199 4 4 5 5 18 1.166
# 200 4 4 4 4 16 0.454
# 201 NA NA NA NA 0 NA
# 202 5 4 4 5 18 1.166
# 203 2 1 4 4 11 -1.011
# 204 4 4 4 5 17 0.796
# 205 4 4 5 5 18 1.166
# 206 NA NA NA NA 0 NA
# 207 4 4 5 5 18 1.166
# 208 5 5 5 5 20 2.035
# 209 3 4 3 4 14 -0.168
# 210 5 5 5 5 20 2.035
# 211 5 5 5 5 20 2.035
The as.character
part in there is to help mitigate potential FP problems.
This is very similar conceptually to the merge
solution that has been provided, but is likely to be much faster.
Benchmarking on an artificial dataset:
set.seed(1)
df <- data.frame(matrix(sample(0:25, 100000, replace = TRUE), ncol = 2))
library(microbenchmark)
microbenchmark(
A = {
df2 <- merge(df, mydf, by.x="X1", by.y="V1", sort = FALSE)
},
B = {
df3 <- cbind(df, recoded = mydf$V2[match(as.character(df$X1),
as.character(mydf$V1))])
}
)
Unit: milliseconds
# expr min lq median uq max neval
# A 141.32530 149.61354 154.99230 162.7845 239.26242 100
# B 24.93267 25.32541 25.73723 26.0792 96.44209 100
The basic matching approach is well over 5x as fast as merge
. Additionally, merge
tends to do funky things with your row ordering. Compare the first few rows of the original dataset (df
), with those of the merged one (df2
), and my solution (df3
). As you can see, merge
has totally rearranged the data.frame
, even though we had specified "sort = FALSE
".
head(df)
# X1 X2
# 1 6 15
# 2 9 18
# 3 14 8
# 4 23 3
# 5 5 22
# 6 23 1
head(df2)
# X1 X2 V2
# 1 6 15 -3.04
# 2 6 23 -3.04
# 3 6 3 -3.04
# 4 6 0 -3.04
# 5 6 20 -3.04
# 6 6 16 -3.04
head(df3)
# X1 X2 recoded
# 1 6 15 -3.040
# 2 9 18 -1.605
# 3 14 8 -0.168
# 4 23 3 4.594
# 5 5 22 -4.287
# 6 23 1 4.594
Upvotes: 7
Reputation: 36104
I like @Henrik's merge
solution, it seems clear and easy to use.
I went the way of using factor
, although I don't think the process of converting back to numeric is ever very elegant. @hadley's solution using cut
is similar.
df = data.frame(x = sample(5:7, 10, replace = TRUE))
# Using factor(), to convert to numeric have to go through a character
df$y = as.numeric(as.character(factor(df$x, labels = c(-4.287, -3.040, -2.405))))
# Using cut() is similar to factor, need to use the breaks argument
df$z = as.numeric(as.character(cut(df$x, breaks = 3, labels = c(-4.287, -3.040, -2.405))))
Upvotes: 1
Reputation: 29875
I am not hundred per cent sure I got your question right. But I think what you ask is: you have a mapping from an integer to a value and you want to replace all integers in a data frame (or vector) by the value specified in the mapping.
I would put the mapping in a list:
code = list()
code[[5 ]] = -4.287
code[[6 ]] = -3.040
code[[7 ]] = -2.405
code[[8 ]] = -1.960
code[[9 ]] = -1.605
code[[10]] = -1.296
code[[11]] = -1.011
code[[12]] = -0.735
code[[13]] = -0.456
code[[14]] = -0.168
code[[15]] = 0.134
code[[16]] = 0.454
code[[17]] = 0.796
code[[18]] = 1.166
code[[19]] = 1.574
code[[20]] = 2.035
code[[21]] = 2.582
code[[22]] = 3.299
code[[23]] = 4.594
And then use apply (or sapply for vector) to do the replacement:
apply(df, c(1,2), function(x) code[[x]])
Upvotes: 1