Reputation: 673
I am having some trouble replacing values in a dataframe. I would like to replace values based on a separate table. Below is an example of what I am trying to do.
I have a table where every row is a customer and every column is an animal they purchased. Lets call this dataframe table
.
> table
# P1 P2 P3
# 1 cat lizard parrot
# 2 lizard parrot cat
# 3 parrot cat lizard
I also have a table that I will reference called lookUp
.
> lookUp
# pet class
# 1 cat mammal
# 2 lizard reptile
# 3 parrot bird
What I want to do is create a new table called new
with a function replaces all values in table
with the class
column in lookUp
. I tried this myself using an lapply
function, but I got the following warnings.
new <- as.data.frame(lapply(table, function(x) {
gsub('.*', lookUp[match(x, lookUp$pet) ,2], x)}), stringsAsFactors = FALSE)
Warning messages:
1: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
argument 'replacement' has length > 1 and only the first element will be used
2: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
argument 'replacement' has length > 1 and only the first element will be used
3: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
argument 'replacement' has length > 1 and only the first element will be used
Any ideas on how to make this work?
Upvotes: 67
Views: 78778
Reputation: 73802
Out of burning curiosity, I just ran a benchmark with some of the approaches that I want to share with you. I couldn't quite believe some of the statements about performance in the answers and am trying to clarify this herewith. In order not to be misled by different rows/columns ratios, I consider three scenarios:
ncol == nrow
ncol << nrow
ncol >> nrow.
It might be beneficial to coerce as.matrix
beforehand, so I included this as an additional solution (unlist_mat).
microbenchmark::microbenchmark(
lapply=Dat1[col_set] <- lapply(Dat1[col_set], function(x) Look$class[match(x, Look$pet)]),
unlist=Dat2[col_set] <- Look$class[match(unlist(Dat2[col_set]), Look$pet)],
unlist_mat=Mat[, col_set] <- Look$class[match(as.vector(Mat[, col_set]), Look$pet)], ## added
ifelse=Dat3[col_set] <- ifelse(Dat3[col_set] == "cat", "mammal",
ifelse(Dat3[col_set] == "lizard", "reptile",
ifelse(Dat3[col_set] == "parrot", "bird", NA))),
look_vec=Dat4[] <- lapply(Dat4, function(i) look[i]),
times=3L
)
## 1e3 x 1e3
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# lapply 40.42905 63.47053 78.03831 86.51201 96.84294 107.17387 3 a
# unlist 513.25197 540.55981 656.25420 567.86766 727.75531 887.64297 3 b
# unlist_mat 45.91743 56.51087 68.50595 67.10432 79.80021 92.49611 3 a
# ifelse 117.83513 153.23771 366.16708 188.64030 490.33306 792.02581 3 ab
# look_vec 58.54449 88.40293 112.91165 118.26137 140.09522 161.92908 3 a
## 1e4 x 1e4
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 2.427077 3.558234 3.992481 4.689390 4.775183 4.860977 3 a
# unlist 73.125989 79.203107 94.027433 85.280225 104.478155 123.676084 3 b
# unlist_mat 4.940254 5.011684 5.576553 5.083114 5.894703 6.706291 3 a
# ifelse 9.714553 14.444899 36.176777 19.175244 49.407889 79.640535 3 a
# look_vec 8.460969 8.558600 8.784463 8.656230 8.946209 9.236188 3 a
## 1e5 x 1e3
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 2.314427 2.403001 3.270708 2.491575 3.748848 5.006120 3 a
# unlist 64.098825 66.850221 81.402676 69.601616 90.054601 110.507586 3 b
# unlist_mat 5.018869 5.060865 5.638499 5.102861 5.948314 6.793767 3 a
# ifelse 6.244744 16.488266 39.208119 26.731788 55.689807 84.647825 3 ab
# look_vec 4.512672 6.434651 7.496267 8.356630 8.988064 9.619498 3 a
## 1e3 x 1e5
# Unit: seconds
# expr min lq mean median uq max neval cld
# lapply 52.833019 55.373432 71.308981 57.913845 80.546963 103.180080 3 ab
# unlist 164.901805 168.710285 186.454796 172.518765 197.231292 221.943819 3 c
# unlist_mat 3.872551 4.422904 4.695393 4.973257 5.106814 5.240372 3 a
# ifelse 72.592437 76.473418 103.930063 80.354399 119.598876 158.843354 3 b
# look_vec 56.444824 58.904604 62.677267 61.364383 65.793488 70.222593 3 ab
Note: Performed on an Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz using R --vanilla
.
all(sapply(list(Dat2, as.data.frame(Mat), Dat3, Dat4), identical, Dat1)) ## *
# [1] TRUE
## *manipulate the data first outside the benchmark, of course!
Using lapply
with a lookup matrix appears to be a good choice if the number of columns is rather low/lower than the number of rows. If we have many columns, especially compared to rows, we might benefit from coercing the respective columns of the data frame into a matrix first, which should only take a blink of an eye.
set.seed(42)
n <- 1e4; m <- 1e4
Dat <- data.frame(matrix(sample(c("cat", "lizard", "parrot"), n*m, replace=TRUE), n, m))
Look <- structure(list(pet = c("cat", "lizard", "parrot"), class = c("mammal", "reptile", "bird")),
class = "data.frame", row.names = c("1", "2", "3"))
look <- setNames(as.character(Look$class), Look$pet)
col_set <- names(Dat)
system.time(
Mat <- as.matrix(Dat)
)
# user system elapsed
# 0.844 0.318 1.161
Dat1 <- Dat2 <- Dat3 <- Dat4 <- Dat
Upvotes: 0
Reputation: 15642
I did it using the factor
built-in.
table$P1 <- factor(table$P1, levels=lookUp$pet, labels=lookUp$class)
table$P2 <- factor(table$P2, levels=lookUp$pet, labels=lookUp$class)
table$P3 <- factor(table$P3, levels=lookUp$pet, labels=lookUp$class)
Upvotes: 4
Reputation: 1210
I tried other approaches and they took a really long time with my very large dataset. I used the following instead:
# make table "new" using ifelse. See data below to avoid re-typing it
new <- ifelse(table1 =="cat", "mammal",
ifelse(table1 == "lizard", "reptile",
ifelse(table1 =="parrot", "bird", NA)))
This method requires you to write more text for your code, but the vectorization of ifelse
makes it run faster. You have to decide, based on your data, if you want to spend more time writing code or waiting for your computer to run. If you want to make sure it worked (you didn't have any typos in your iflese
commands), you can use apply(new, 2, function(x) mean(is.na(x)))
.
data
# create the data table
table1 <- read.table(text = "
P1 P2 P3
1 cat lizard parrot
2 lizard parrot cat
3 parrot cat lizard", header = TRUE)
Upvotes: 2
Reputation: 34763
Anytime you have two separate data.frame
s and are trying to bring info from one to the other, the answer is to merge.
Everyone has their own favorite merge method in R. Mine is data.table
.
Also, since you want to do this to many columns, it'll be faster to melt
and dcast
-- rather than loop over columns, apply it once to a reshaped table, then reshape again.
library(data.table)
#the row names will be our ID variable for melting
setDT(table, keep.rownames = TRUE)
setDT(lookUp)
#now melt, merge, recast
# melting (reshape wide to long)
table[ , melt(.SD, id.vars = 'rn')
# merging
][lookup, new_value := i.class, on = c(value = 'pet')
#reform back to original shape
][ , dcast(.SD, rn ~ variable, value.var = 'new_value')]
# rn P1 P2 P3
# 1: 1 mammal reptile bird
# 2: 2 reptile bird mammal
# 3: 3 bird mammal reptile
In case you find the dcast
/melt
bit a bit intimidating, here's an approach that just loops over columns; dcast
/melt
is simply sidestepping the loop for this problem.
setDT(table) #don't need row names this time
setDT(lookUp)
sapply(names(table), #(or to whichever are the relevant columns)
function(cc) table[lookUp, (cc) := #merge, replace
#need to pass a _named_ vector to 'on', so use setNames
i.class, on = setNames("pet", cc)])
Upvotes: 22
Reputation: 83
The answer above showing how to do this in dplyr doesn't answer the question, the table is filled with NAs. This worked, I would appreciate any comments showing a better way:
# Add a customer column so that we can put things back in the right order
table$customer = seq(nrow(table))
classTable <- table %>%
# put in long format, naming column filled with P1, P2, P3 "petCount"
gather(key="petCount", value="pet", -customer) %>%
# add a new column based on the pet's class in data frame "lookup"
left_join(lookup, by="pet") %>%
# since you wanted to replace the values in "table" with their
# "class", remove the pet column
select(-pet) %>%
# put data back into wide format
spread(key="petCount", value="class")
Note that it would likely be useful to keep the long table that contains the customer, the pet, the pet's species(?) and their class. This example simply adds an intermediary save to a variable:
table$customer = seq(nrow(table))
petClasses <- table %>%
gather(key="petCount", value="pet", -customer) %>%
left_join(lookup, by="pet")
custPetClasses <- petClasses %>%
select(-pet) %>%
spread(key="petCount", value="class")
Upvotes: 3
Reputation: 56259
Make a named vector, and loop through every column and match, see:
# make lookup vector with names
lookUp1 <- setNames(as.character(lookUp$class), lookUp$pet)
lookUp1
# cat lizard parrot
# "mammal" "reptile" "bird"
# match on names get values from lookup vector
res <- data.frame(lapply(df1, function(i) lookUp1[i]))
# reset rownames
rownames(res) <- NULL
# res
# P1 P2 P3
# 1 mammal reptile bird
# 2 reptile bird mammal
# 3 bird mammal reptile
df1 <- read.table(text = "
P1 P2 P3
1 cat lizard parrot
2 lizard parrot cat
3 parrot cat lizard", header = TRUE)
lookUp <- read.table(text = "
pet class
1 cat mammal
2 lizard reptile
3 parrot bird", header = TRUE)
Upvotes: 13
Reputation: 70336
You posted an approach in your question which was not bad. Here's a smiliar approach:
new <- df # create a copy of df
# using lapply, loop over columns and match values to the look up table. store in "new".
new[] <- lapply(df, function(x) look$class[match(x, look$pet)])
An alternative approach which will be faster is:
new <- df
new[] <- look$class[match(unlist(df), look$pet)]
Note that I use empty brackets ([]
) in both cases to keep the structure of new
as it was (a data.frame).
(I'm using df
instead of table
and look
instead of lookup
in my answer)
Upvotes: 52
Reputation: 18487
Another options is a combination of tidyr
and dplyr
library(dplyr)
library(tidyr)
table %>%
gather(key = "pet") %>%
left_join(lookup, by = "pet") %>%
spread(key = pet, value = class)
Upvotes: 28