user13069688
user13069688

Reputation: 353

How to create a loop with multple conditions for censoring individuals

I have the following dataset:

# Create the dataset
set.seed(123)  # for reproducibility
new_dataset <- data.frame(
  ID = sample(1000:9999, 10, replace = TRUE)  # Generate ID
)

# Create variables for each year from 2011 to 2017 and assign random numbers from 0 to 5
for (year in 2011:2017) {
  weeks_in_year <- if (year %% 4 == 0) 53 else 52  # Check for leap year

  # Create a matrix for each year with random numbers
  year_matrix <- matrix(sample(0:5, 10 * weeks_in_year, replace = TRUE), ncol = weeks_in_year)

  # Assign the matrix to a new variable with the appropriate column names
  colnames(year_matrix) <- paste0("y_", substr(year, 3, 4), sprintf("%02d", 1:weeks_in_year))
  new_dataset[paste0("y_", substr(year, 3, 4), sprintf("%02d", 1:weeks_in_year))] <- 
  year_matrix
  }

# Determine the number of cells to set to blank (70% of total cells)
total_cells <- nrow(new_dataset) * ncol(new_dataset)
cells_to_set_blank <- round(0.7 * total_cells)

# Randomly select cells to set to blank
cells_to_modify <- sample(1:total_cells, cells_to_set_blank, replace = FALSE)
rows_to_modify <- (cells_to_modify - 1) %/% ncol(new_dataset) + 1
cols_to_modify <- (cells_to_modify - 1) %% ncol(new_dataset) + 1

# Set selected cells to blank (empty strings) in the new data frame
for (i in 1:length(cells_to_modify)) {
  new_dataset[rows_to_modify[i], cols_to_modify[i] + 1] <- NA  # +1 to account for 'ID' column
}

# Add start_of_follow_up and end_of_follow_up columns
new_dataset$start_of_follow_up <- c(1146, 1247, 1348, 1449, 1150, 1151, 1150, 1150, 1150, 
1150)
new_dataset$end_of_follow_up   <- c(1248, 1249, 1352, 1552, 1252, 1312, 1205, 1305, 1305, 
1207)

now this data are data from a register which are from 2011 to 2017.

The variable ID is the variable that indicates an ID and is an individual per row

the variables that start with y_ are the variables that indicate dates in the following format: for example y_1148 indicates that 11 is the year 2011 and the 48 indicates week 48, y_1207 indicates that 12 is the year 2012 and 07 is week 07 of that year and so on. so all these variables indicate a follow up per week in the register.

these variables have some codes from 0 to 5 which indicate a reason for censoring of those individuals and some are empty cells which means that these people did not have any code.

then I have two other variables that are named start_of_follow_up and end_of_follow_up which indicate what year and week the follow up in the register should start and what year week the follow up should end for each inidividual. for example for the first ID the follow up should start at 1146 which means year 2011 and week 46 and end in 1248 which means year 2012 and week 48. This dates should be matched with the variable that start with y_ so that means for example that for the first individual I want to search in the range of these variables: y_1146 until y_1248 if any of the codes I am writing further down exist and so on for all the other individuals.

so what I want to do is to create a code where it searches in every individual those codes from 0 to 5 within the variables that start with y_ within the range of dates that are defined from the start_of_follow_up and end_of_follow_up.

The codes are 0 or 1 or 2 or 4 or two consecutive times the code 5. if you find any of these codes first within those ranges from the start to the end of follow up then i want to censor that observation and create a new variable named censored_week that will have for each observation the name of that variable y_ that the code 0 or 1 or 2 or 4 or two consecutive times 2 was found. So i know which week each individual was censored.

For example the first individual has the code 1 in the variable y_1146. so in the variable censored_week i want you to give the code 1146 for that individual which will indicate that this individual was censored in that year and week. Then follow-up should stop. Then I need another variable that will indicate why this individual was censored. which means I want the reason i.e. the code that was found first during the follow up i.e. 0 or 1 or 2 or 4 or two consecutive times the code 2.

if we won't find any of these codes within those ranges for each participants weeks then give that variable the value from the end of follow up period which will indicate that the participant was not censored. and in the other variable write that this individuals was not censored.

Hope someone can help me with that huge tasks :)

Thanks in advance

Upvotes: 0

Views: 70

Answers (1)

DaveArmstrong
DaveArmstrong

Reputation: 21982

I would do this by pivoting the data to longer format. Then you can use group_by(ID) and identify when the various codes happen. Then you can pick out the earliest code for each ID and then merge it back into the original data.

library(tidyr)
library(dplyr)
library(stringr)
censored_codes <- new_dataset %>% 
  pivot_longer(starts_with("y_"), names_to="week", values_to = "code") %>% 
  mutate(week_num = as.numeric(str_extract(week, "\\d+"))) %>% 
  group_by(ID) %>% 
  filter(week_num >= start_of_follow_up & week_num <= end_of_follow_up) %>% 
  mutate(code0 = ifelse(code == 0, week, NA), 
         code1 = ifelse(code == 1, week, NA), 
         code2 = ifelse(code == 2, week, NA), 
         code4 = ifelse(code == 4, week, NA),
         code22 = ifelse(code == 2 & lag(code) == 2, week, NA)) %>%
  select(ID, code0:code22) %>% 
  pivot_longer(-ID, names_to = "censored_code", values_to = "censored_week") %>%
  na.omit() %>% 
  group_by(ID) %>% 
  arrange(censored_week, .by_group = TRUE) %>% 
  slice_head(n=1) 
censored_codes
#> # A tibble: 10 × 3
#> # Groups:   ID [10]
#>       ID censored_code censored_week
#>    <int> <chr>         <chr>        
#>  1  2841 code0         y_1150       
#>  2  3462 code1         y_1146       
#>  3  3510 code0         y_1248       
#>  4  3756 code4         y_1152       
#>  5  3985 code4         y_1452       
#>  6  4370 code1         y_1203       
#>  7  5760 code0         y_1201       
#>  8  6106 code2         y_1150       
#>  9  7745 code0         y_1204       
#> 10  9717 code0         y_1348

new_dataset <- left_join(new_dataset, censored_codes)
#> Joining with `by = join_by(ID)`

Without the data and something to check the output I don't want to say this will work in your situation for sure, but you should be able to do something like below to count two consecutive values of 2 for females:

censored_codes2 <- new_dataset %>% 
  pivot_longer(starts_with("y_"), names_to="week", values_to = "code") %>% 
  mutate(week_num = as.numeric(str_extract(week, "\\d+"))) %>% 
  group_by(ID) %>% 
  filter(week_num >= start_of_follow_up & week_num <= end_of_follow_up) %>% 
  mutate(censored_week = ifelse(code == 2 & lag(code) == 2 & gender == 0, week, NA)) %>%
  select(ID, censored_week) %>% 
  arrange(censored_week, .by_group = TRUE) %>% 
  slice_head(n=1) 

Upvotes: 0

Related Questions