Imconfused
Imconfused

Reputation: 31

Confused about if statement and for loop in R

So I have a Data frame in R where one column is a variable of a few factors and I want to create a handful of dummy variables for each factor but when I write a loop to do this I get an error.

So for example if the column is made up of various factors a, b, c and I want to code a dummy variable of 1 or 0 for each one, the code I have to create one is:

h = rep(0, nrow(data))
for (i in 1:nrow(data)) {
  if (data[,1] == "a") {
    h[i] = 1
  } else {
    h[i] = 0
  }
}
cbind(data, h)

This gives me the error message "the condition has length > 1 and only the first element will be used" I have seen in other places on this site saying I should try and write my own function to solve problems and avoid for loops and I don't really understand a) how to solve this by writing a function (at least immediately) b)the benefit of doing this as a function rather than with loops.

Also I ended up using the ifelse statement to create each vector and then cbind to add it to the data frame but an explanation would really be appreciated.

Upvotes: 0

Views: 4229

Answers (2)

Gregor Thomas
Gregor Thomas

Reputation: 146164

Aakash is correct in pointing out the problem in your loop. Your test is

if (data[,1] == "a")

Since your test doesn't depend on i, it will be the same for every iteration. You could fix your loop like this:

h = rep(0, nrow(data))
for (i in 1:nrow(data)) {
  if (data[i, 1] == "a")
    h[i] = 1
  } else {
    h[i] = 0
  }
}

We could even simplify, since h is initialized to 0, there is no need to set it to 0 in the else case, we can just move on:

for (i in 1:nrow(data)) {
  if (data[i, 1] == "a")
    h[i] = 1
  }
}

A more substantial improvement would be to introduce vectorization. This will speed up your code and is usually easier to write once you get the hang of it. if can only check a single condition, but ifelse is vectorized, it will take a vector of tests, a vector of "if true" results, a vector of "if false" results, and combine them:

h = ifelse(data[, 1] == "a", 1, 0)

With this, there is no need to initialize h before the statement, and we could add it directly to a data frame:

data$h = ifelse(data[, 1] == "a", 1, 0)

In this case, your test case and results are so simple, that we can do even better.

data[, 1] == "a" ## run this and look at the output

The above code is just a boolean vector of TRUE and FALSE. If we run as.numeric() on it TRUE values will be coerced to 1s and FALSE values will be coerced to 0s. So we can just do

data$h = as.numeric(data[, 1] == "a")

which will be even more efficient than ifelse.

This operation is so simple that there is no benefit in writing a function to do it.

Upvotes: 1

aakash
aakash

Reputation: 143

Change if (data[,1] == "a") { to if (data[i,1] == "a") {

Upvotes: 2

Related Questions