armin
armin

Reputation: 77

How to calculate a new column based on other columns using a lookup approach in R?

I am trying to calculate another column in a dataframe based on another columns and a lookup table. I have a simple example that only shows few data (my real dataset contains millions of rows).

I have the following datasets:

  lookup<- data.frame("class"=c(1, 2, 1, 2), "type"=c("A", "B", "B", "A"), 
           "condition1"=c(50, 60, 55, 53), "condition2"=c(80, 85, 86, 83))

  lookup
  class type condition1 condition2
      1    A         50         80
      2    B         60         85
      1    B         55         86
      2    A         53         83

My dataframe is of this shape:

  data<- data.frame("class"=c(1, 2, 2, 1, 2, 1), 
         "type"=c("A","B", "A", "A", "B", "B"), 
         "percentage_condition1"=c(0.3, 0.6, 0.1, 0.2, 0.4, 0.5), 
         "percentage_condition2"=c(0.7, 0.4, 0.9, 0.8, 0.6, 0.5))


  data
  class type percentage_condition1 percentage_condition2
    1    A                   0.3                   0.7
    2    B                   0.6                   0.4
    2    A                   0.1                   0.9
    1    A                   0.2                   0.8
    2    B                   0.4                   0.6
    1    B                   0.5                   0.5

I would like to create a new column in my dataframe named data that will use the lookup table such as :

in my data where my class matches my type columns, it can calculate a new column in my dataframe data such as (not real code):

d$new<- lookup$condition1 * data$percentage_condition1 + lookup$condition2 * data$percentage_condition2

I know how to do it with a if else statement but I am trying to do it more efficiently as I am working with a lot of data. I know to do it with one column in the lookup table but I do not succeed with several columns (class and type column).

Thanks for any help and suggestions!

Upvotes: 1

Views: 641

Answers (2)

akrun
akrun

Reputation: 887391

We can use match to get the index of 'type' columns for 'data' and 'type', use that index to get the corresponding rows of 'condition1', 'condition2' columns, multiply with the percentage columns of 'data' and get the rowSums

data$new <- rowSums(lookup[match(paste(data$class, data$type), 
                  paste(lookup$class, lookup$type)), 
               c("condition1", "condition2")] * data[3:4])

data
#  class type percentage_condition1 percentage_condition2  new
#1     1    A                   0.3                   0.7 71.0
#2     2    B                   0.6                   0.4 70.0
#3     2    A                   0.1                   0.9 80.0
#4     1    A                   0.2                   0.8 74.0
#5     2    B                   0.4                   0.6 75.0
#6     1    B                   0.5                   0.5 70.5

NOTE: With match, we can do it much easier


Or using data.table

library(data.table)
setDT(data)[lookup, new := condition1 * percentage_condition1 + 
       condition2 * percentage_condition2, on = .(class, type)]
data
#   class type percentage_condition1 percentage_condition2  new
#1:     1    A                   0.3                   0.7 71.0
#2:     2    B                   0.6                   0.4 70.0
#3:     2    A                   0.1                   0.9 80.0
#4:     1    A                   0.2                   0.8 74.0
#5:     2    B                   0.4                   0.6 75.0
#6:     1    B                   0.5                   0.5 70.5

Or using tidyverse

library(tidyverse)
data %>% 
     left_join(lookup, by = c("class", "type")) %>%
     mutate(new = condition1 * percentage_condition1 + 
       condition2 * percentage_condition2) %>%
     select(names(data), new)
#   class type percentage_condition1 percentage_condition2  new
#1     1    A                   0.3                   0.7 71.0
#2     2    B                   0.6                   0.4 70.0
#3     2    A                   0.1                   0.9 80.0
#4     1    A                   0.2                   0.8 74.0
#5     2    B                   0.4                   0.6 75.0
#6     1    B                   0.5                   0.5 70.5

Or use a SQL based solution with sqldf

library(sqldf)
str1 <- "SELECT data.class, data.type, data.percentage_condition1, 
  data.percentage_condition2, (data.percentage_condition1 * lookup.condition1 + 
   data.percentage_condition2 * lookup.condition2) as new
   FROM data 
   LEFT JOIN lookup on data.class = lookup.class AND 
   data.type = lookup.type"
sqldf(str1)

Or as @G.Grothendieck mentioned in the comments, with alias identifiers, sqldf solution can be made more compact

sqldf("select D.*, L.condition1 * D.[percentage_condition1] + 
       L.condition2 * D.[percentage_condition2] as new 
       from data as D 
       left join lookup as L 
       using(class, type)")

NOTE: All the solutions maintains the original order of the dataset

Upvotes: 2

Ronak Shah
Ronak Shah

Reputation: 389065

One option is to merge data and lookup and then perform the calculation

df1 <- merge(data, lookup) #This merges by class and type columns

df1$new <- with(df1, (condition1 * percentage_condition1) + 
                     (condition2 * percentage_condition2))


df1
#  class type percentage_condition1 percentage_condition2 condition1 condition2  new
#1     1    A                   0.3                   0.7         50         80 71.0
#2     1    A                   0.2                   0.8         50         80 74.0
#3     1    B                   0.5                   0.5         55         86 70.5
#4     2    A                   0.1                   0.9         53         83 80.0
#5     2    B                   0.6                   0.4         60         85 70.0
#6     2    B                   0.4                   0.6         60         85 75.0

Upvotes: 2

Related Questions