Reputation: 77
I am trying to calculate another column in a dataframe based on another columns and a lookup table. I have a simple example that only shows few data (my real dataset contains millions of rows).
I have the following datasets:
lookup<- data.frame("class"=c(1, 2, 1, 2), "type"=c("A", "B", "B", "A"),
"condition1"=c(50, 60, 55, 53), "condition2"=c(80, 85, 86, 83))
lookup
class type condition1 condition2
1 A 50 80
2 B 60 85
1 B 55 86
2 A 53 83
My dataframe is of this shape:
data<- data.frame("class"=c(1, 2, 2, 1, 2, 1),
"type"=c("A","B", "A", "A", "B", "B"),
"percentage_condition1"=c(0.3, 0.6, 0.1, 0.2, 0.4, 0.5),
"percentage_condition2"=c(0.7, 0.4, 0.9, 0.8, 0.6, 0.5))
data
class type percentage_condition1 percentage_condition2
1 A 0.3 0.7
2 B 0.6 0.4
2 A 0.1 0.9
1 A 0.2 0.8
2 B 0.4 0.6
1 B 0.5 0.5
I would like to create a new column in my dataframe named data that will use the lookup table such as :
in my data where my class matches my type columns, it can calculate a new column in my dataframe data such as (not real code):
d$new<- lookup$condition1 * data$percentage_condition1 + lookup$condition2 * data$percentage_condition2
I know how to do it with a if else statement but I am trying to do it more efficiently as I am working with a lot of data. I know to do it with one column in the lookup table but I do not succeed with several columns (class and type column).
Thanks for any help and suggestions!
Upvotes: 1
Views: 641
Reputation: 887391
We can use match
to get the index of 'type' columns for 'data' and 'type', use that index to get the corresponding rows of 'condition1', 'condition2' columns, multiply with the percentage columns of 'data' and get the rowSums
data$new <- rowSums(lookup[match(paste(data$class, data$type),
paste(lookup$class, lookup$type)),
c("condition1", "condition2")] * data[3:4])
data
# class type percentage_condition1 percentage_condition2 new
#1 1 A 0.3 0.7 71.0
#2 2 B 0.6 0.4 70.0
#3 2 A 0.1 0.9 80.0
#4 1 A 0.2 0.8 74.0
#5 2 B 0.4 0.6 75.0
#6 1 B 0.5 0.5 70.5
NOTE: With match
, we can do it much easier
Or using data.table
library(data.table)
setDT(data)[lookup, new := condition1 * percentage_condition1 +
condition2 * percentage_condition2, on = .(class, type)]
data
# class type percentage_condition1 percentage_condition2 new
#1: 1 A 0.3 0.7 71.0
#2: 2 B 0.6 0.4 70.0
#3: 2 A 0.1 0.9 80.0
#4: 1 A 0.2 0.8 74.0
#5: 2 B 0.4 0.6 75.0
#6: 1 B 0.5 0.5 70.5
Or using tidyverse
library(tidyverse)
data %>%
left_join(lookup, by = c("class", "type")) %>%
mutate(new = condition1 * percentage_condition1 +
condition2 * percentage_condition2) %>%
select(names(data), new)
# class type percentage_condition1 percentage_condition2 new
#1 1 A 0.3 0.7 71.0
#2 2 B 0.6 0.4 70.0
#3 2 A 0.1 0.9 80.0
#4 1 A 0.2 0.8 74.0
#5 2 B 0.4 0.6 75.0
#6 1 B 0.5 0.5 70.5
Or use a SQL based solution with sqldf
library(sqldf)
str1 <- "SELECT data.class, data.type, data.percentage_condition1,
data.percentage_condition2, (data.percentage_condition1 * lookup.condition1 +
data.percentage_condition2 * lookup.condition2) as new
FROM data
LEFT JOIN lookup on data.class = lookup.class AND
data.type = lookup.type"
sqldf(str1)
Or as @G.Grothendieck mentioned in the comments, with alias identifiers, sqldf
solution can be made more compact
sqldf("select D.*, L.condition1 * D.[percentage_condition1] +
L.condition2 * D.[percentage_condition2] as new
from data as D
left join lookup as L
using(class, type)")
NOTE: All the solutions maintains the original order of the dataset
Upvotes: 2
Reputation: 389065
One option is to merge
data
and lookup
and then perform the calculation
df1 <- merge(data, lookup) #This merges by class and type columns
df1$new <- with(df1, (condition1 * percentage_condition1) +
(condition2 * percentage_condition2))
df1
# class type percentage_condition1 percentage_condition2 condition1 condition2 new
#1 1 A 0.3 0.7 50 80 71.0
#2 1 A 0.2 0.8 50 80 74.0
#3 1 B 0.5 0.5 55 86 70.5
#4 2 A 0.1 0.9 53 83 80.0
#5 2 B 0.6 0.4 60 85 70.0
#6 2 B 0.4 0.6 60 85 75.0
Upvotes: 2