Mary B.
Mary B.

Reputation: 125

R - Create new variable based on certain conditions

I have the following data frame df with longitudinal data in long format (see below). I would like to create a new variable called new.var based on some conditions. This new variable should be 1, if the first value of postLin for a person is > 0 but <= 1 or if the the value of postLin == 0 and the value of preLin == 0. There can only be a maximum of one 1 on new.var per person (ID).

df <- read.table(text=
"ID       preLin   postLin      
800057    -8.55    0               
800057    -6.34    0           
800057    -5.34    0           
800057    -4.34    0         
800057        0    0.33   
800119    -0.88    0  
800119        0    0           
800119        0    1       
834011     -4.1    0 
834011     -3.1    0   
834341        0    1.34 
834341        0    2.34   
834341        0    3.34   
834341        0    5.34    
834341        0    6.66  
800125        0    0
800125        0    2.14
897177    -0.33    0 
897177        0    0.67 
897177        0    1.67", header=TRUE)

First, I tried this code:

df$new.var1 <- ifelse(df$preLin == 0 & (df$postLin >= 0 & df$postLin <= 1), 1, 0)

However, for ID 800119 there will then be two 1s.

My expected outcome would look like this:

df_new <- read.table(text=
"ID       preLin   postLin    new.var  
800057    -8.55    0          0         
800057    -6.34    0          0  
800057    -5.34    0          0
800057    -4.34    0          0     
800057        0    0.33       1 
800119    -0.88    0          0 
800119        0    0          1  
800119        0    1          0
834011     -4.1    0          0 
834011     -3.1    0          0 
834341        0    1.34       0
834341        0    2.34       0
834341        0    3.34       0  
834341        0    5.34       0    
834341        0    6.66       0
800125        0    0          1
800125        0    2.14       0
897177    -0.33    0          0
897177        0    0.67       1
897177        0    1.67       0", header=TRUE) 

Does anyone know a solution for my issue? Many thanks in advance!

Upvotes: 0

Views: 98

Answers (2)

MarBlo
MarBlo

Reputation: 4514

Here is an approach which uses tidyverse. It groups your IDs and uses the filter you want. A little helper columns is build which finds the first occurence in postLin, it is removed later. mutatewithifelse applies your rules. Then the result is joined with the original DF and NAs are turned into zeros.

library(tidyverse)


df %>%
  group_by(ID) %>%
  dplyr::filter(postLin > 0 | (postLin == 0 & preLin == 0)) %>%
  dplyr::mutate(
    first = dplyr::first(postLin)
  ) %>%
  mutate(new.var = ifelse((postLin == first & postLin < 1), 1, 0)) %>%
  select(-c(first)) %>%
  right_join(df, by = c("ID", "preLin", "postLin")) %>%
  mutate(new.var = ifelse(is.na(new.var), 0, new.var)) %>%
  arrange(ID, preLin, postLin)
#> # A tibble: 20 x 4
#> # Groups:   ID [6]
#>        ID preLin postLin new.var
#>     <int>  <dbl>   <dbl>   <dbl>
#>  1 800057  -8.55    0          0
#>  2 800057  -6.34    0          0
#>  3 800057  -5.34    0          0
#>  4 800057  -4.34    0          0
#>  5 800057   0       0.33       1
#>  6 800119  -0.88    0          0
#>  7 800119   0       0          1
#>  8 800119   0       1          0
#>  9 800125   0       0          1
#> 10 800125   0       2.14       0
#> 11 834011  -4.1     0          0
#> 12 834011  -3.1     0          0
#> 13 834341   0       1.34       0
#> 14 834341   0       2.34       0
#> 15 834341   0       3.34       0
#> 16 834341   0       5.34       0
#> 17 834341   0       6.66       0
#> 18 897177  -0.33    0          0
#> 19 897177   0       0.67       1
#> 20 897177   0       1.67       0

Upvotes: 0

davidnortes
davidnortes

Reputation: 932

Your expected outcome comes from declaring the condition in ifelse as follows:

df$new.var1 <- ifelse((df$postLin > 0 & df$postLin < 1) | (df$preLin == 0 & df$postLin ==0), 1, 0)

value = 1 if df$postLin gets values between 0 and 1 (but not 0 and 1) or if both df$preLin and df$postLin are 0. Otherwise value = 0

Result:

       ID preLin postLin new.var1
1  800057  -8.55    0.00        0
2  800057  -6.34    0.00        0
3  800057  -5.34    0.00        0
4  800057  -4.34    0.00        0
5  800057   0.00    0.33        1
6  800119  -0.88    0.00        0
7  800119   0.00    0.00        1
8  800119   0.00    1.00        0
9  834011  -4.10    0.00        0
10 834011  -3.10    0.00        0
11 834341   0.00    1.34        0
12 834341   0.00    2.34        0
13 834341   0.00    3.34        0
14 834341   0.00    5.34        0
15 834341   0.00    6.66        0
16 800125   0.00    0.00        1
17 800125   0.00    2.14        0
18 897177  -0.33    0.00        0
19 897177   0.00    0.67        1
20 897177   0.00    1.67        0

Remark: keeping the condition as <=1 will result in an additional one in the third 800119, as @shirewoman2 said in her comment

Upvotes: 0

Related Questions