imprela
imprela

Reputation: 83

Delete columns in an R loop

I have a dataframe where I want to replace the variables

Then I'd like to delete the variables age1_corr_1, age1_corr_2, ..., age1_corr_n. I have figured out how to do the first part (change the values) in a loop but couldn't figure out how to delete the variables after. Any suggestion?

Sample data

y <- data.frame("age_1" = c(5,1,1,10), "age1_corr_1" = c(1,NA,NA,0), "age_2" = c(1,2,3,4), "age1_corr_2" = c(NA, NA, 10, 9),
            "age_3" = c(4,3,2,5), "age1_corr_3" = c(NA,NA,NA,6), "age_4" = c(1,4,2,7), "age1_corr_4" = c(NA, NA, NA,NA))

The code that will change values of age_n based on age1_corr_n

for(i in 1:4){
  cname1 <- paste0("age_",i)
  cname2 <- paste0("age1_corr_",i)
  y[,cname1] <- ifelse(!is.na(y[,cname2]), y[,cname2], y[,cname1])
}

The output I'd like to have is

  age_1 age_2 age_3 age_4
1     1     1     4     1
2     1     2     3     4
3     1    10     2     2
4     0     9     6     7       

Upvotes: 0

Views: 2568

Answers (3)

wibeasley
wibeasley

Reputation: 5287

Here are two other approaches that can handle an arbitrary number of columns. For this specific example dataset, make sure that the 4th column is correctly represented as a float with y$age1_corr_4 <- as.numeric(y$age1_corr_4)).

Like Dan Hall's response, one approach keeps the columns you want...

library(magrittr)
coalesce_corr1 <- function( index ) {
  name_age  <- paste0("age_"      , index)
  name_corr <- paste0("age1_corr_", index)
  y %>%
    dplyr::mutate(
      !!name_age   := dplyr::coalesce(.data[[name_corr]], .data[[name_age]])
    ) %>%
    dplyr::select(!!name_age)
}

1:4 %>% 
  purrr::map(coalesce_corr) %>% 
  dplyr::bind_cols()

...and the other drops the columns you don't want.

z <- y
coalesce_corr2 <- function( index ) {
  name_age  <- paste0( "age_"      , index)
  name_corr <- paste0( "age1_corr_", index)

  z <<- z %>%
    dplyr::mutate(
      !!name_age   := dplyr::coalesce(.data[[!!name_corr]], .data[[!!name_age]])
    )
  z[[name_corr]] <<- NULL
}

1:4 %>% 
  purrr::walk(coalesce_corr2)
z

I wish this last one didn't require a global variable (that uses <<-), and for this reason, I actually recommend Dan's approaches, but I wanted to try out quosures for output variables.

Upvotes: 0

wibeasley
wibeasley

Reputation: 5287

Do the real work with dplyr::coalesce() (description: "Given a set of vectors, coalesce() finds the first non-missing value at each position."). Then drop the columns with dplyr::select(), using a negative sign in front of the columns you don't need anymore.

library(magrittr)
y %>% 
  dplyr::mutate(
    age1_corr_4     = as.numeric(age1_corr_4), # Delete this line if it's already a numeric/floating data type.
    age_1           = dplyr::coalesce(age1_corr_1, age_1),
    age_2           = dplyr::coalesce(age1_corr_2, age_2),
    age_3           = dplyr::coalesce(age1_corr_3, age_3),
    age_4           = dplyr::coalesce(age1_corr_4, age_4)
  ) %>% 
  dplyr::select(
    -age1_corr_1, -age1_corr_2, -age1_corr_3, -age1_corr_4
  )

Produces

  age_1 age_2 age_3 age_4
1     1     1     4     1
2     1     2     3     4
3     1    10     2     2
4     0     9     6     7

Edit: I apologize, I focused on the coalesce part of the task and ignored the n part of the task.

Upvotes: 1

De Novo
De Novo

Reputation: 7600

You have several options if there is a pattern to the columns you want to remove (or conversely, the ones you want to keep).

Here's the data you provided:

y <- data.frame("age_1" = c(5,1,1,10), "age1_corr_1" = c(1,NA,NA,0), "age_2" = c(1,2,3,4), "age1_corr_2" = c(NA, NA, 10, 9),
            "age_3" = c(4,3,2,5), "age1_corr_3" = c(NA,NA,NA,6), "age_4" = c(1,4,2,7), "age1_corr_4" = c(NA, NA, NA,NA))

Here's a dplyr example of how to get only those columns that follow the pattern age_N, where N is 1, 2, 3, or 4:

library(dplyr)
x <- select(y, paste("age", 1:4, sep = "_"))

Alternatively, you could choose the pattern for the columns you DON'T want:

x <- select(y, -grep("_corr_", current_vars()))

This uses the following strategy: * you can select for everything BUT a column or set of columns by adding a minus sign first. * current_vars() is a helper function in dplyr that evaluates to all the variable names for the data (here, y)

Upvotes: 1

Related Questions