sa-al
sa-al

Reputation: 5

Fill values into column of data frame based on two conditions

I have a dataframe where I would like to fill in values from column "reference column" into the column "trial_number" based on the following two conditions:

The values should only be filled into column "trial_number" if both conditions are met at the same time.

One complication is that the columns "start_time" and "end_time" contain more values than columns "a", "b" and "reference_column". The idea is that each of the values in "reference_column" is filled in multiple rows of column "trial_number".

I as well have the columns "a", "b" and "reference_column" in a separate dataframe.

A simplified version of the dataframe looks like this:

 reference_column  a  b trial_number start_time end_time
1                 1 10 20           NA         12       18
2                 2 22 24           NA         22       23
3                 3 30 40           NA         23       24
4                 4 45 55           NA         31       32
5                NA NA NA           NA         35       36
6                NA NA NA           NA         36       37
7                NA NA NA           NA         37       39
8                NA NA NA           NA         46       51
9                NA NA NA           NA         47       53
10               NA NA NA           NA         50       54

In the end, the desired output (filled values in "trial_number" should look like this:

 reference_column  a  b trial_number start_time end_time
1                 1 10 20           01         12       18
2                 2 22 24           02         22       23
3                 3 30 40           02         23       24
4                 4 45 55           03         31       32
5                NA NA NA           03         35       36
6                NA NA NA           03         36       37
7                NA NA NA           03         37       39
8                NA NA NA           04         46       51
9                NA NA NA           04         47       53
10               NA NA NA           04         50       54

Since I am quite new to R, I would be very happy about any recommendations. I already tried to use a for loop in combination with an if-statement, but wasn't successful because of the unbalanced number of data entries and various conditions. Thanks a lot in advance!

Upvotes: 0

Views: 74

Answers (2)

Susan Switzer
Susan Switzer

Reputation: 1922

Perhaps this approach would be helpful?

Dataset:

library(tidyverse)

df <- data.frame(referenceColumn = seq(from =1, to = 10, by =1), 
                 a = c(10, 22, 30, 45, NA, NA, NA, NA, NA, NA), 
                 b = c(20, 24, 40, 55, NA, NA, NA, NA, NA, NA), 
                 trialNumber = rep(NA, 10), 
                 startTime = c(12, 22, 23, 31, 35, 36, 37, 46, 47, 50), 
                 endTime = c(18, 23, 24, 32, 36, 37, 39, 51, 53, 54))

Separate your reference table and dataset



dimDf <- df %>% 
  select(referenceColumn:b) 
              

factDf <- df %>% 
  select(trialNumber:endTime)

Use your reference table in a case_when()

dfNew <- factDf %>% 
  mutate(trialNumber = case_when(
    startTime >= dimDf$a[1] & endTime <= dimDf$b[1] ~ dimDf$referenceColumn[1], 
    startTime >= dimDf$a[2] & endTime <= dimDf$b[2] ~ dimDf$referenceColumn[2], 
    startTime >= dimDf$a[3] & endTime <= dimDf$b[3] ~ dimDf$referenceColumn[3], 
    startTime >= dimDf$a[4] & endTime <= dimDf$b[4] ~ dimDf$referenceColumn[4],
    TRUE ~ NA_real_
  ))

example

or keep all columns

dfNew <- df %>% 
  mutate(trialNumber = case_when(
    startTime >= dimDf$a[1] & endTime <= dimDf$b[1] ~ dimDf$referenceColumn[1], 
    startTime >= dimDf$a[2] & endTime <= dimDf$b[2] ~ dimDf$referenceColumn[2], 
    startTime >= dimDf$a[3] & endTime <= dimDf$b[3] ~ dimDf$referenceColumn[3], 
    startTime >= dimDf$a[4] & endTime <= dimDf$b[4] ~ dimDf$referenceColumn[4],
    TRUE ~ NA_real_
  ))

example 2

Upvotes: 1

ScottyJ
ScottyJ

Reputation: 1087

My answer is going to be incomplete because I don't fully understand what you are doing with this statement:

One complication is that the columns "start_time" and "end_time" contain more values than columns "a", "b" and "reference_column". The idea is that each of the values in "reference_column" is filled in multiple rows of column "trial_number".

But I think I have some directional help:

library(tidyverse)

df <- tribble(~reference_column,  ~a,  ~b, ~trial_number, ~start_time, ~end_time,
               1, 10, 20,           NA,         12,       18,
               2, 22, 24,           NA,         22,       23,
               3, 30, 40,           NA,         23,       24,
               4, 45, 55,           NA,         31,       32,
              NA, NA, NA,           NA,         35,       36,
              NA, NA, NA,           NA,         36,       37,
              NA, NA, NA,           NA,         37,       39,
              NA, NA, NA,           NA,         46,       51,
              NA, NA, NA,           NA,         47,       53,
              NA, NA, NA,           NA,         50,       54)


df %>% 
  mutate(trial_number = if_else(start_time >= a & end_time <= b, 
                                reference_column,
                                trial_number))

# A tibble: 10 × 6
   reference_column     a     b trial_number start_time end_time
              <dbl> <dbl> <dbl>        <dbl>      <dbl>    <dbl>
 1                1    10    20            1         12       18
 2                2    22    24            2         22       23
 3                3    30    40           NA         23       24
 4                4    45    55           NA         31       32
 5               NA    NA    NA           NA         35       36
 6               NA    NA    NA           NA         36       37
 7               NA    NA    NA           NA         37       39
 8               NA    NA    NA           NA         46       51
 9               NA    NA    NA           NA         47       53
10               NA    NA    NA           NA         50       54

In your example output, you find some way to put values for "trial_number" even where df$a and df$b are == NA. Please either explain the additional logic in your question, or hopefully this is enough to give you some ideas how to proceed.

Upvotes: 0

Related Questions