KT_1
KT_1

Reputation: 8494

Recode from a long list of variables

Using the example dataframe:

df <- structure(list(KY27SCH1 = c(4, 4, 4, 4, NA, 5, 2, 4, 4, NA, 4, 
                                  5, 3, 5, 5), KY27SCH2 = c(5, 4, 4, 4, NA, 4, 1, 4, 4, NA, 4, 
                                                            5, 4, 5, 5), KY27SCH3 = c(4, 4, 5, 4, NA, 4, 4, 4, 5, NA, 5, 
                                                                                      5, 3, 5, 5), KY27SCH4 = c(3, 5, 5, 4, NA, 5, 4, 5, 5, NA, 5, 
                                                                                                                5, 4, 5, 5)), .Names = c("KY27SCH1", "KY27SCH2", "KY27SCH3", 
                                                                                                                                         "KY27SCH4"), row.names = 197:211, class = "data.frame")

I apply a line of code to add the four different columns together before binding this new column to the original dataframe:

KC27sc_R <- rowSums(df[, c("KY27SCH1", "KY27SCH2", "KY27SCH3", "KY27SCH4")], na.rm = TRUE)
df <- cbind(df, KC27sc_R) # Adds columns to survey dataframe

I then want to recode variable KC27sc_R using the table of results detailed below:

5= -4.287
6 = -3.040
7 = -2.405
8 = -1.960
9 = -1.605
10 = -1.296
11 = -1.011
12 = -0.735
13 = -0.456
14 = -0.168
15 = 0.134
16 = 0.454
17 = 0.796
18 = 1.166
19 = 1.574
20 = 2.035
21 = 2.582
22 = 3.299 
23 = 4.594

i.e. 5 in column KC27sc_R would become -4.287.

Is there a way of recoding a column from a list of numbers without having to go through each number in turn? I normally use the recode function but I am unsure how to do this with a large list.

Any help would be appreciated.

Upvotes: 3

Views: 4852

Answers (7)

Skif
Skif

Reputation: 288

For starters, let's suppose that your table of results is stored in a matrix yo:

yo <- matrix(0, nrow = 19, ncol = 2)
yo[, 1] <- c(5:23)
yo[, 2] <- c( -4.287, -3.040, -2.405, -1.960, -1.605, -1.296, -1.011, -0.735, -0.456, -0.168, 0.134, 0.454, 0.796, 1.166, 1.574, 2.035, 2.582, 3.299, 4.594)

i.e., the first column of yo corresponds to values you want to be changed, the second - what you are changing with. Put simply, treat yo as a function - first column is this function's x variable, second is the function's output.

The first thing you need to figure out is the index of values of KC27sc_R are actually present in yo[, 1] (in which rows you can actually substitute old values with the new ones). This is done like so:

ind <- which( df$KC27sc_R %in% yo[,1] )

ind gives you the rows of all the values of KC27sc_R you can change. The next step would be to get all these values:

a <- df[ind,]$KC27sc_R

The final step is to link the values in a with those in yo[, 1] - literally find rows of yo[, 1] where you can find each of the corresponding values in a - function match would be helpful here:

b <- match( a, yo[,1] )

Like ind, b is an index - for each value in a, it tells you which row of yo[, 2] you need to go to to find the replacement for this value in a. Final step is to replace values in your df:

df[ind, "KC27sc_R"] <- yo[b, 2]

That will do the trick.

Upvotes: 1

Henrik
Henrik

Reputation: 67818

If you keep your recoding values in a second data frame, you may try merge:

# original data
df <- data.frame(x = sample(5:7, 10, replace = TRUE))

# recoding data
df2 <- data.frame(x = 5:7, new_x = c(-4.287, -3.040, -2.405))

merge(df, df2)

Edit following @hadley's comment

The solution above is testing floating point numbers for equality, which is it not a reliable method, see R FAQ 7.31 Why doesn't R think these numbers are equal?. One way to handle this is to set the columns used for merging to class "integer". (I note that this potential issue is not considered in the last example in ?merge)

# original data
df <- data.frame(x = as.integer(sample(5:7, 10, replace = TRUE)))

# recoding data
df2 <- data.frame(x = as.integer(5:7), new_x = c(-4.287, -3.040, -2.405))

merge(df, df2)

Edit following @Ananda Mahto's comment - funky sorting with merge and handling of NA

# original data with NA
df <- data.frame(x = as.integer(c(7, NA, 5, 6, NA, 5)))

# recoding data as above, without NA
merge(df, df2, sort = FALSE)
# 'unspecified' order and no match with NA

# can at least handle NA by including NA also in recoding data
df2 <- data.frame(x = as.integer(c(5:7, NA)), new_x = c(-4.287, -3.040, -2.405, NA))
merge(df, df2, sort = FALSE)

A possible merge-oid solution: joinin plyr package
"Unlike merge, join preserves the order of x no matter what join type is used", and no need for NA in recoding data.

library(plyr)
df <- data.frame(x = as.integer(c(7, NA, 5, 6, NA, 5)))
df2 <- data.frame(x = as.integer(c(5:7)), new_x = c(-4.287, -3.040, -2.405))
join(df, df2)
# looks OK

From ?join: "Join is often faster than merge". Whether this is the case here, and whether it is faster than match, I leave for someone else to show.

Upvotes: 3

beroe
beroe

Reputation: 12326

An answer that requires a minimum of typing and is portable:

# Your original values
origval = seq(5,23)
newval = c(-4.287, -3.04, -2.405, -1.96, -1.605, -1.296, -1.011, -0.735, -0.456, -0.168, 0.134, 0.454, 0.796, 1.166, 1.574, 2.035, 2.582, 3.299, 4.594)

# generate a relationship
sp = smooth.spline(origval,newval)

# look up a value based on your original sequence
pval = predict(sp, origval)

Now pval$y will contain the predicted (converted) points.

You can put any other set of values into the predict function in place of origval, in any order, even ones that weren't in the data series (5.5, etc)

Applied to your data set, you can create a placeholder for your variable and then "predict" its values:

df$KY_Rnew = df$KC27sc_R
df$KY_Rnew[!is.na(df$KY_Rnew)] = predict(sp,df$KY_Rnew[!is.na(df$KY_Rnew)])$y

Upvotes: 1

lauratboyer
lauratboyer

Reputation: 334

Assuming your mapping values are all integers, you could create a vector containing the coded values at the position of the mapping values:

# using mydf defined by Ananda Mahto:
mydf <- structure(list(V1 = c(5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 
 16, 17, 18, 19, 20, 21, 22, 23), V2 = c(-4.287, -3.04, -2.405, 
 -1.96, -1.605, -1.296, -1.011, -0.735, -0.456, -0.168, 0.134, 
 0.454, 0.796, 1.166, 1.574, 2.035, 2.582, 3.299, 4.594)), .Names = c("V1", 
 "V2"), class = "data.frame", row.names = c(NA, -19L))

# create vector with index positions corresponding to objective values:
vmap <- rep(NA, length=max(mydf$V1)) 
vmap[mydf$V1] <- mydf$V2

vmap
# [1]     NA     NA     NA     NA -4.287 -3.040 -2.405 -1.960 -1.605 -1.296
# [11] -1.011 -0.735 -0.456 -0.168  0.134  0.454  0.796  1.166  1.574  2.035
# [21]  2.582  3.299  4.594

# Assign NA to zero values in KC27sc_R (as you cannot have a zero position in a R vector)
# (this could also be another value defined in mydf if you want zero to map to something)
KC27sc_R[KC27sc_R==0] <- NA

# Then, select the values in vmap using the indices defined in KC27sc_R:
Krecode <- vmap[KC27sc_R]
data.frame(KC27sc_R, Krecode)


# KC27sc_R Krecode
# 197       16   0.454
# 198       17   0.796
# 199       18   1.166
# 200       16   0.454
# 201       NA      NA
# 202       18   1.166
# ... etc

This should be fairly fast with a long list as all operations are vectorized.

Upvotes: 1

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193677

Assume we have put your lookup table as a data.frame that looks like this:

mydf <- structure(list(V1 = c(5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 
    16, 17, 18, 19, 20, 21, 22, 23), V2 = c(-4.287, -3.04, -2.405, 
    -1.96, -1.605, -1.296, -1.011, -0.735, -0.456, -0.168, 0.134, 
    0.454, 0.796, 1.166, 1.574, 2.035, 2.582, 3.299, 4.594)), .Names = c("V1", 
    "V2"), class = "data.frame", row.names = c(NA, -19L))
mydf
#    V1     V2
# 1   5 -4.287
# 2   6 -3.040
# 3   7 -2.405
# 4   8 -1.960
# 5   9 -1.605
# 6  10 -1.296
# 7  11 -1.011
# 8  12 -0.735
# 9  13 -0.456
# 10 14 -0.168
# 11 15  0.134
# 12 16  0.454
# 13 17  0.796
# 14 18  1.166
# 15 19  1.574
# 16 20  2.035
# 17 21  2.582
# 18 22  3.299
# 19 23  4.594

You should be able to just use something like the following to get what you are looking for:

df$RECODED <- mydf$V2[match(as.character(df$KC27sc_R), as.character(mydf$V1))]
df
#     KY27SCH1 KY27SCH2 KY27SCH3 KY27SCH4 KC27sc_R RECODED
# 197        4        5        4        3       16   0.454
# 198        4        4        4        5       17   0.796
# 199        4        4        5        5       18   1.166
# 200        4        4        4        4       16   0.454
# 201       NA       NA       NA       NA        0      NA
# 202        5        4        4        5       18   1.166
# 203        2        1        4        4       11  -1.011
# 204        4        4        4        5       17   0.796
# 205        4        4        5        5       18   1.166
# 206       NA       NA       NA       NA        0      NA
# 207        4        4        5        5       18   1.166
# 208        5        5        5        5       20   2.035
# 209        3        4        3        4       14  -0.168
# 210        5        5        5        5       20   2.035
# 211        5        5        5        5       20   2.035

The as.character part in there is to help mitigate potential FP problems.


This is very similar conceptually to the merge solution that has been provided, but is likely to be much faster.

Benchmarking on an artificial dataset:

set.seed(1)
df <- data.frame(matrix(sample(0:25, 100000, replace = TRUE), ncol = 2))

library(microbenchmark)
microbenchmark(
  A = {
    df2 <- merge(df, mydf, by.x="X1", by.y="V1", sort = FALSE)
  },
  B = {
    df3 <- cbind(df, recoded = mydf$V2[match(as.character(df$X1), 
                                             as.character(mydf$V1))])
  }
)
Unit: milliseconds
#  expr       min        lq    median       uq       max neval
#     A 141.32530 149.61354 154.99230 162.7845 239.26242   100
#     B  24.93267  25.32541  25.73723  26.0792  96.44209   100

The basic matching approach is well over 5x as fast as merge. Additionally, merge tends to do funky things with your row ordering. Compare the first few rows of the original dataset (df), with those of the merged one (df2), and my solution (df3). As you can see, merge has totally rearranged the data.frame, even though we had specified "sort = FALSE".

head(df)
#   X1 X2
# 1  6 15
# 2  9 18
# 3 14  8
# 4 23  3
# 5  5 22
# 6 23  1
head(df2)
#   X1 X2    V2
# 1  6 15 -3.04
# 2  6 23 -3.04
# 3  6  3 -3.04
# 4  6  0 -3.04
# 5  6 20 -3.04
# 6  6 16 -3.04
head(df3)
#   X1 X2 recoded
# 1  6 15  -3.040
# 2  9 18  -1.605
# 3 14  8  -0.168
# 4 23  3   4.594
# 5  5 22  -4.287
# 6 23  1   4.594

Upvotes: 7

aosmith
aosmith

Reputation: 36104

I like @Henrik's merge solution, it seems clear and easy to use.

I went the way of using factor, although I don't think the process of converting back to numeric is ever very elegant. @hadley's solution using cut is similar.

df = data.frame(x = sample(5:7, 10, replace = TRUE))
# Using factor(), to convert to numeric have to go through a character
df$y = as.numeric(as.character(factor(df$x, labels = c(-4.287, -3.040, -2.405))))

# Using cut() is similar to factor, need to use the breaks argument
df$z = as.numeric(as.character(cut(df$x, breaks = 3, labels = c(-4.287, -3.040, -2.405))))

Upvotes: 1

f3lix
f3lix

Reputation: 29875

I am not hundred per cent sure I got your question right. But I think what you ask is: you have a mapping from an integer to a value and you want to replace all integers in a data frame (or vector) by the value specified in the mapping.

I would put the mapping in a list:

code = list()
code[[5 ]] = -4.287
code[[6 ]] = -3.040
code[[7 ]] = -2.405
code[[8 ]] = -1.960
code[[9 ]] = -1.605
code[[10]] = -1.296
code[[11]] = -1.011
code[[12]] = -0.735
code[[13]] = -0.456
code[[14]] = -0.168
code[[15]] = 0.134
code[[16]] = 0.454
code[[17]] = 0.796
code[[18]] = 1.166
code[[19]] = 1.574
code[[20]] = 2.035
code[[21]] = 2.582
code[[22]] = 3.299 
code[[23]] = 4.594

And then use apply (or sapply for vector) to do the replacement:

apply(df, c(1,2), function(x) code[[x]])

Upvotes: 1

Related Questions