A standard way of creating a new variable based on others and other rows

Question

I know ways of creating new variables but which one follows the R logic the most?

I usually use a loop because it's the easiest to write but it's probably slower than other approaches.

countries <- c("USA", "GER", "POL", "UK")
years <- c(2014, 2015, 2016, 2017, 2018, 2019)
var.value <- runif(length(countries) * length(years), min = 1, max = 100)

our.data.frame <- merge(countries, years, all = TRUE)
our.data.frame <- cbind(our.data.frame, var.value)
colnames(our.data.frame) <- c("Country", "Year", "Value")

# Suppose we want to write a variable which takes sum of "Value"
# for the given and the next year, for the given country
produce.new.var <- function(our.data.frame) {
  new.var <- numeric(0)

  for(i in 1:nrow(our.data.frame)) {
    next.year.i <- which(
      our.data.frame$Country == our.data.frame$Country[i]
      & our.data.frame$Year == our.data.frame$Year[i] + 1
    )

    if(length(next.year.i) == 0) {
      new.var[i] <- our.data.frame$Value[i]
    } else {
      new.var[i] <- our.data.frame$Value[i] + our.data.frame$Value[next.year.i]
    }
  }

  new.var
}

our.data.frame <- cbind(our.data.frame, NewVar = produce.new.var(our.data.frame))

This is also cool because the new variable is produced in the correct order, so cbinding is very comfortable. But I feel I should do it using some vectorisation or at least using which()... But then I feel that writing it and gluing the new variable to the data frame is not simple. I'm surely missing something.

By the way, I usually work on huge sets of data, of number of rows between 1k and 1kk and usually about 10-30 columns. It may matter.

Edit: I would be interested in a solution in base R, without (for example) dplyr.

A standard way of creating a new variable based on others and other rows

Answers (1)

Related Questions