Reputation: 150
I know ways of creating new variables but which one follows the R logic the most?
I usually use a loop because it's the easiest to write but it's probably slower than other approaches.
countries <- c("USA", "GER", "POL", "UK")
years <- c(2014, 2015, 2016, 2017, 2018, 2019)
var.value <- runif(length(countries) * length(years), min = 1, max = 100)
our.data.frame <- merge(countries, years, all = TRUE)
our.data.frame <- cbind(our.data.frame, var.value)
colnames(our.data.frame) <- c("Country", "Year", "Value")
# Suppose we want to write a variable which takes sum of "Value"
# for the given and the next year, for the given country
produce.new.var <- function(our.data.frame) {
new.var <- numeric(0)
for(i in 1:nrow(our.data.frame)) {
next.year.i <- which(
our.data.frame$Country == our.data.frame$Country[i]
& our.data.frame$Year == our.data.frame$Year[i] + 1
)
if(length(next.year.i) == 0) {
new.var[i] <- our.data.frame$Value[i]
} else {
new.var[i] <- our.data.frame$Value[i] + our.data.frame$Value[next.year.i]
}
}
new.var
}
our.data.frame <- cbind(our.data.frame, NewVar = produce.new.var(our.data.frame))
This is also cool because the new variable is produced in the correct order, so cbinding is very comfortable. But I feel I should do it using some vectorisation or at least using which()... But then I feel that writing it and gluing the new variable to the data frame is not simple. I'm surely missing something.
By the way, I usually work on huge sets of data, of number of rows between 1k and 1kk and usually about 10-30 columns. It may matter.
Edit: I would be interested in a solution in base R, without (for example) dplyr.
Upvotes: 1
Views: 52
Reputation: 52907
Take a look at lead()
and lag()
from dplyr
Here's one way to do what you're after
library(dplyr)
our.data.frame %>%
arrange(Year, Country) %>%
group_by(Country) %>%
mutate(NewVar = Value + lead(Value))
Upvotes: 2