lschuetze
lschuetze

Reputation: 1218

Normalize data to value depending on multiple fields and conditions

I am quite new to R. I have a table that with the header (Value, Benchmark, Suite, Var) and I want to normalize each Value to the mean of the baseline, depending on the combination of (Benchmark, Var). So, each entry (Value, Benchmark, Suite, Var) should be normalized to the mean value of the baseline where Benchmark and var are equal.

The data represents different benchmark measurements, where var are different input sizes. The data looks like this draft:

Value   Benchmark  Suite     Var
500     Benchmark2 baseline  1732
889     Benchmark  baseline  1732
500     Benchmark2 baseline  1732
889     Benchmark  baseline  1732
300     Benchmark  Approach1 1732
100     Benchmark2 Approach1 1732

After the transformation, it would look like this:

Value   Benchmark  Suite     Var   RuntimeRatio
500     Benchmark2 baseline  1732  1.00
889     Benchmark  baseline  1732  1.00
500     Benchmark2 baseline  1732  1.00
889     Benchmark  baseline  1732  1.00
300     Benchmark  Approach1 1732  0.34 # 300 compared to mean(889,889) of each (Benchmark,baseline,1732)
100     Benchmark2 Approach1 1732  0.20 # 100 compared to mean(500,500) of each (Benchmark2,baseline,1732)

I currently have something like, but that does not calculate the right thing:

norm <- ddply(data, Var ~ Benchmark, transform,
          RuntimeRatio = Value / mean(Value[Suite == "baseline"]))

Upvotes: 0

Views: 73

Answers (1)

Edo
Edo

Reputation: 7818

I think the best and cleanest way to do it is to have a bit of data manipolation prior to the operation.

Your Data:

df <- tibble::tribble(
  
  ~Value, ~Benchmark  ,  ~Suite     , ~Var,
  500   , "Benchmark2", "baseline"  , 1732,
  889   , "Benchmark" , "baseline"  , 1732,
  500   , "Benchmark2", "baseline"  , 1732,
  889   , "Benchmark" , "baseline"  , 1732,
  300   , "Benchmark" , "Approach1" , 1732,
  100   , "Benchmark2", "Approach1" , 1732
  
)

With the package dplyr we can easily and intuitively manipulate data.

library(dplyr)

# separate the baseline from the rest
df_baseline <- df %>% filter(Suite == "baseline")
df_compare  <- df %>% filter(Suite != "baseline")

# calculate the mean of the baseline value for each Benchmark-Var
df_baseline <- df_baseline %>% 
  group_by(Benchmark, Var) %>% 
  summarise(Value_baseline = mean(Value)) %>% 
  ungroup()

# Join the baseline data to the rest of your data with the approaches
df_compare <- df_compare %>%
  left_join(df_baseline, by = c("Benchmark", "Var"))

# Calculate your ratio
df_compare %>%
  mutate(RuntimeRatio = Value / Value_baseline)

# # A tibble: 2 x 6
#   Value Benchmark  Suite       Var Value_baseline RuntimeRatio
#   <dbl> <chr>      <chr>     <dbl>          <dbl>        <dbl>
# 1   300 Benchmark  Approach1  1732            889        0.337
# 2   100 Benchmark2 Approach1  1732            500        0.2  

This approach gets what I believe you may need.

But if you want exactly what you asked, you need to join df_baseline to the original df in this way:

df %>% 
  left_join(df_baseline, by = c("Benchmark", "Var")) %>% 
  mutate(RuntimeRatio = Value / Value_baseline) %>% 
  select(-Value_baseline)

# # A tibble: 6 x 5
#   Value Benchmark  Suite       Var RuntimeRatio
#   <dbl> <chr>      <chr>     <dbl>        <dbl>
# 1   500 Benchmark2 baseline   1732        1    
# 2   889 Benchmark  baseline   1732        1    
# 3   500 Benchmark2 baseline   1732        1    
# 4   889 Benchmark  baseline   1732        1    
# 5   300 Benchmark  Approach1  1732        0.337
# 6   100 Benchmark2 Approach1  1732        0.2  

Upvotes: 2

Related Questions