trine
trine

Reputation: 5

Calculate value based on multiple other values

I would like to make a some new variables in R based on multiple (>100) other variables.

My dataset looks like this

  sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br diag_2_br diag_3_br diag_4_br diag_5_br
1      1          1          1          2          2          1         6         0         1         6         1
2      2          2                                3          2         5                             1         3
3      3          0          0       <NA>          4          1         0         0      <NA>         2         2
4      4         NA                     1          2          2        NA                   1         1         4
5      5         NA          4          2          3          5        NA         4         3         4         3

The variables diag_x_ais can take integers from 0-6, and diag_x_br can take integers between 1-6. I would like to make 6 new variables corresponding to the 6 possible diag_x_br values, i.e. the new variables would be called br_1, br_2 ... br_6. These new variables shall then be filled with the maximum value of the corresponding diag_x_ais variables, i.e. if diag_1_br, diag_2_br, and diag_4_br are all 3, then br_3 should take the maximum value of diag_1_ais, diag_2_ais, and diag_4_ais.

Please also see the example dataset below:

  sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br diag_2_br diag_3_br diag_4_br diag_5_br br_1 br_2 br_3 br_4 br_5 br_6
1      1          1          1          2          2          1         6         0         1         6         1    2   NA   NA   NA   NA    2
2      2          2          1          4          3          5         5         2         2         1         3    3    4    5   NA    2   NA
3      3          0          0         NA          4          1         0         0        NA         2         2   NA    4   NA   NA   NA   NA
4      4         NA                     1          2          2        NA                   1         1         4    2   NA   NA    2   NA   NA
5      5         NA          4          2          3          5        NA         4         3         4         3   NA   NA    5    4   NA   NA

Hereafter, I would like a final variable which calculates the sum of the up to three largest br_x variables, example displayed below:

  sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br diag_2_br diag_3_br diag_4_br diag_5_br br_1 br_2 br_3 br_4 br_5 br_6 sum3
1      1          1          1          2          2          1         6         0         1         6         1    2   NA   NA   NA   NA    2    4
2      2          2          1          4          3          5         5         2         2         1         3    3    4    5   NA    2   NA   12
3      3          0          0         NA          4          1         0         0        NA         2         2   NA    4   NA   NA   NA   NA    4
4      4         NA                     1          2          2        NA                   1         1         4    2   NA   NA    2   NA   NA    4
5      5         NA          4          2          3          5        NA         4         3         4         3   NA   NA    5    4   NA   NA    9

My actual dataset has 60 diag_x_ais variables and 60 diag_x_br variables and 4000 rows.

I hope that someone can help me do this in R. Thank you!

Upvotes: 0

Views: 171

Answers (3)

Martin Gal
Martin Gal

Reputation: 16978

You could use some heavy data.transforming most likely not very efficient on large datasets. There are some empty values, NA and 0 in your dataset. I didn't handle them (and replaced the empty values by NA to make importing easier).

library(tidyr)
library(dplyr)

data %>% 
  pivot_longer(-sub_id, 
               names_to = c("name", "cat"), 
               names_pattern = ".*_(\\d+)_(.*)") %>% 
  pivot_wider(names_from = "cat") %>% 
  group_by(sub_id, br) %>% 
  summarise(value = max(ais), .groups = "drop") %>% 
  filter(br %in% 1:6) %>%
  group_by(sub_id) %>% 
  mutate(sum = sum(tail(sort(value), 3))) %>% 
  pivot_wider(names_from = br,
              names_glue = "br_{br}") %>% 
  select(sub_id, paste0("br_", 1:6), sum)

This returns

# A tibble: 5 x 8
  sub_id  br_1  br_2  br_3  br_4  br_5  br_6   sum
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1      1     2    NA    NA    NA    NA     2     4
2      2     3     4     5    NA     2    NA    12
3      3    NA     4    NA    NA    NA    NA     4
4      4     2    NA    NA     2    NA    NA     4
5      5    NA    NA     5     4    NA    NA     9

Piping an addtional right_join(data, by = "sub_id") gives you your example output (minus the order of your columns). I took an idea from this answer.

Data

data <- structure(list(sub_id = c(1, 2, 3, 4, 5), diag_1_ais = c(1, 2, 
0, NA, NA), diag_2_ais = c(1, 1, 0, NA, 4), diag_3_ais = c(2, 
4, NA, 1, 2), diag_4_ais = c(2, 3, 4, 2, 3), diag_5_ais = c(1, 
5, 1, 2, 5), diag_1_br = c(6, 5, 0, NA, NA), diag_2_br = c(0, 
2, 0, NA, 4), diag_3_br = c(1, 2, NA, 1, 3), diag_4_br = c(6, 
1, 2, 1, 4), diag_5_br = c(1, 3, 2, 4, 3)), row.names = c(NA, 
-5L), class = c("tbl_df", "tbl", "data.frame"))

Upvotes: 2

OneTrickDragon
OneTrickDragon

Reputation: 87

For the first part:

data <- data.frame(sub_id = c(1,2,3,4,5),
                 diag_1_ais  = c(1,2,0,NA,NA),
                 diag_2_ais  = c(1,1,0,NA,4),
                 diag_3_ais  = c(2,4,NA,1,2),
                 diag_4_ais  = c(2,3,4,2,3),
                 diag_5_ais  = c(1,5,1,2,5),
                 diag_1_br = c(6,5,0,NA,NA),
                 diag_2_br = c(0,2,0,NA,4),
                 diag_3_br = c(1,2,NA,1,3),
                 diag_4_br = c(6,1,2,1,4),
                 diag_5_br = c(1,3,2,4,3))

calc_br <- function(data, value, firstBr, lastBr) {
    br <- c()
    for (i in 1:nrow(data)){
        if (length(which(data[i,c(firstBr:lastBr)] %in% value))!=0){
          br <- c(br, c(max(data[i,which(data[i,c(firstBr:lastBr)] %in% value)+1]))) 
        }
        else {
            br <- c(br, c(NA))
        }
    }
    result <- br
}

firstBr = 7
lastBr = 11

data$br_1 <- calc_br(data,1,firstBr,lastBr)
data$br_2 <- calc_br(data,2,firstBr,lastBr)
data$br_3 <- calc_br(data,3,firstBr,lastBr)
data$br_4 <- calc_br(data,4,firstBr,lastBr)
data$br_5 <- calc_br(data,5,firstBr,lastBr)
data$br_6 <- calc_br(data,6,firstBr,lastBr)

This should yield the same results as in your example. You should only have to exchange lastBr and firstBr (to 62 and 122 i would guess).

For the second part this should do the trick:

br_sum <- c()
for (i in 1:nrow(data)){
    br_sum <- c(br_sum, sum(data[i,lastBr+tail(order(data[i,c((lastBr+1):(lastBr+6))], na.last = NA), 3)]))
}
data$br_sum <- br_sum

For completness here my results:

  sub_id diag_1_ais diag_2_ais diag_3_ais diag_4_ais diag_5_ais diag_1_br
1      1          1          1          2          2          1         6
2      2          2          1          4          3          5         5
3      3          0          0         NA          4          1         0
4      4         NA         NA          1          2          2        NA
5      5         NA          4          2          3          5        NA
  diag_2_br diag_3_br diag_4_br diag_5_br br_1 br_2 br_3 br_4 br_5 br_6 br_sum
1         0         1         6         1    2   NA   NA   NA   NA    2      4
2         2         2         1         3    3    4    5   NA    2   NA     12
3         0        NA         2         2   NA    4   NA   NA   NA   NA      4
4        NA         1         1         4    2   NA   NA    2   NA   NA      4
5         4         3         4         3   NA   NA    5    4   NA   NA      9

Upvotes: 1

Anoushiravan R
Anoushiravan R

Reputation: 21918

I think you could use the following solution. I made a slight modification so that we only sum the first 3 max values:

library(dplyr)
library(purrr)

df %>%
  bind_cols(as.data.frame(t(map_dfr(1:6, function(a) pmap_dfc(df, ~ {x <- c(...)[grepl("br", names(df))]
  inds <- which(x == a) 
  if(length(inds) != 0) {
    y <- c(...)[grepl("ais", names(df))]
    max(y[inds])
  } else {
    NA
  }})))) %>%
    setNames(paste0("br", 1:6))) %>% 
  rowwise() %>%
  mutate(sum = sum(sort(as.numeric(c_across(starts_with("br"))), decreasing = TRUE)[1:3], na.rm = TRUE)) %>%
  select(starts_with("br"), sum)

Resulting output

# A tibble: 5 x 7
# Rowwise: 
  br1   br2   br3   br4   br5   br6     sum
  <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2     NA    NA    NA    NA    2         4
2 3     4     5     NA    2     NA       12
3 NA    4     NA    NA    NA    NA        4
4 2     NA    NA    2     NA    NA        4
5 NA    NA    5     4     NA    NA        9

Upvotes: 2

Related Questions