kd1978
kd1978

Reputation: 511

Use column names from vector in for loop in dplyr

this should probably be quite straightforward, but I am struggling to get it to work. I currently have a vector of column names:

columns <- c('product1', 'product2', 'product3', 'support4')

I now want to use dplyr in a for loop to mutate some columns, but I am struggling to make it recognize that it is a column name, not a variable.

for (col in columns) {
  cross.sell.val <- cross.sell.val %>%
    dplyr::mutate(col = ifelse(col == 6, 6, col)) %>%
    dplyr::mutate(col = ifelse(col == 5, 6, col))
}

Can I use %>% in these situations? Thanks..

Upvotes: 1

Views: 1424

Answers (1)

Mark Peterson
Mark Peterson

Reputation: 9570

You should be able to do this without using a for loop at all.

Because you didn't provide any data, I am going to use the builtin iris dataset. The top of it looks like:

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

First, I am saving the columns to analyze:

columns <- names(iris)[1:4]

Then, use mutate_at for each column, along with that particular rule. In each, the . represents the vector for each column. Your example implies that the rules are the same for each column, though if that is not the case, you may need more flexibility here.

mod_iris <-
  iris %>%
  mutate_at(columns, funs(ifelse(. > 5, 6, .))) %>%
  mutate_at(columns, funs(ifelse(. < 1, 1, .)))

returns:

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          6.0         3.5          1.4           1  setosa
2          4.9         3.0          1.4           1  setosa
3          4.7         3.2          1.3           1  setosa
4          4.6         3.1          1.5           1  setosa
5          5.0         3.6          1.4           1  setosa
6          6.0         3.9          1.7           1  setosa

If you wanted to, you could instead write a function to make all of your changes for the column. This could also allow you to set the cutoffs differently for each column. For example, you may want to set the bottom and top portions of the data to be equal to that threshold (to reign in outliers for some reason), or you may know that each variable uses a dummy value as a placeholder (and that value is different by column, but is always the most common value). You could easily add in any arbitrary rule of interest this way, and it gives you a bit more flexibility than chaining together separate rules (e.g., if you use the mean, the mean changes when you change some of the values).

An example function:

modColumns <- function(x){
  botThresh <- quantile(x, 0.25)
  topThresh <- quantile(x, 0.75)

  dummyVal <- as.numeric(names(sort(table(x)))[1])
  dummyReplace <- NA

  x <- ifelse(x < botThresh, botThresh, x)
  x <- ifelse(x > topThresh, topThresh, x)
  x <- ifelse(x == dummyVal, dummyReplace, x)

  return(x)
}

And in use:

iris %>%
  mutate_at(columns, modColumns) %>%
  head

returns:

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.3          1.6         0.3  setosa
2          5.1         3.0          1.6         0.3  setosa
3          5.1         3.2          1.6         0.3  setosa
4          5.1         3.1          1.6         0.3  setosa
5          5.1         3.3          1.6         0.3  setosa
6          5.4         3.3          1.7         0.4  setosa

Upvotes: 2

Related Questions