cparmstrong
cparmstrong

Reputation: 819

Loop, create new variable as function of existing variable with conditional

I have some data that contains 400+ columns and ~80 observations. I would like to use a for loop to go through each column and, if it contains the desired prefix exp_, I would like to create a new column which is that value divided by a reference column, stored as the same name but with a suffix _pp. I'd also like to do an else if with the other prefix rev_ but I think as long as I can get the first problem figured out I can solve the rest myself. Some example data is below:

exp_alpha     exp_bravo    rev_charlie     rev_delta     pupils
10            28           38              95            2
24            56           39              24            5
94            50           95              45            3
15            93           72              83            9
72            66           10              12            3

The first time I tried it, the loop ran through properly but only stored the final column in which the if statement was true, rather than storing each column in which the if statement was true. I made some tweaks and lost that code but now have this which runs without error but doesn't modify the data frame at all.

for (i in colnames(test)) {
  if(grepl("exp_", colnames(test)[i])) {
    test[paste(i,"pp", sep="_")] <- test[i] / test$pupils)
  }
}

My understanding of what this is doing:

  1. loop through the vector of column names
  2. if the substring "exp_" is in the ith element of the colnames vector == TRUE
  3. create a new column in the data set which is the ith element of the colnames vector divided by the reference category (pupils), and with "_pp" appended at the end
  4. else do nothing

I imagine since my the code is executing without error but not doing anything that my problem is in the if() statement, but I can't figure out what I'm doing wrong. I also tried adding "==TRUE" in the if() statement but that achieved the same result.

Upvotes: 5

Views: 2937

Answers (3)

pogibas
pogibas

Reputation: 28329

Linear solution:

Don't use loop for that! You can linearize your code and run it much faster than looping over columns. Here's how to do it:

# Extract column names
cNames <- colnames(test)
# Find exp in column names
foo <- grep("exp", cNames)
# Divide by reference: ALL columns at the SAME time
bar <- test[, foo] / test$pupils
# Rename exp to pp : ALL columns at the SAME time
colnames(bar) <- gsub("exp", "pp", cNames[foo])
# Add to original dataset instead of iteratively appending 
cbind(test, bar)

Upvotes: 1

Luke C
Luke C

Reputation: 10301

As an alternative to @timfaber's answer, you can keep your first line the same but not treat i as an index:

for (i in colnames(test)) {
  if(grepl("exp_", i)) {
    print(i)
    test[paste(i,"pp", sep="_")] <- test[i] / test$pupils
  }
}

Upvotes: 2

timfaber
timfaber

Reputation: 2070

Almost correct, you did not define the length of the loop so nothing happened. Try this:

for (i in 1:length(colnames(test))) {
  if(grepl("exp_", colnames(test)[i])) {
  test[paste(i,"pp", sep="_")] <- test[i] / test$pupils
  }
}

Upvotes: 3

Related Questions