Varun
Varun

Reputation: 1321

R Using lag() to create new columns in dataframe

I have a dataset with survey score results for 3 hospitals over a number of years. This survey contains 2 questions.

The dataset looks like this -

set.seed(1234)
library(dplyr)
library(tidyr)

dataset= data.frame(Hospital=c(rep('A',10),rep('B',8),rep('C',6)),
                    YearN=c(2015,2016,2017,2018,2019,
                           2015,2016,2017,2018,2019,
                           2015,2016,2017,2018,
                           2015,2016,2017,2018,
                           2015,2016,2017,
                           2015,2016,2017),
                    Question=c(rep('Overall Satisfaction',5),
                               rep('Overall Cleanliness',5),
                               rep('Overall Satisfaction',4),
                               rep('Overall Cleanliness',4),
                               rep('Overall Satisfaction',3),
                               rep('Overall Cleanliness',3)),
                    ScoreYearN=c(rep(runif(24,min = 0.6,max = 1))),
                    TotalYearN=c(rep(round(runif(24,min = 1000,max = 5000),0))))

MY OBJECTIVE

To add two columns to the dataset such that -

  1. The first column contains the score for the given question in the given hospital for the previous year
  2. The second column contains the total number of respondents for the given question in the given hospital for the previous year

MY ATTEMPT

I called the first column ScoreYearN-1 and the second column TotalYearN-1 I used the lag function to create the new columns that contain the lagged values from the existing columns.

library(dplyr)
library(tidyr)

dataset$`ScoreYearN-1`=lag(dataset$ScoreYearN)
dataset$`TotalYearN-1`=lag(dataset$TotalYearN)

Which gives me a resulting dataset where I have the desired outcome for the first five rows only (these rows correspond to the first Hospital-Question combination).

The remaining rows do not account for this grouping, and hence the 2015 'N-1' values take on the values of the previous group.

I'm not sure this is the best way to go about this problem. If you have any better suggestions, I'm happy to consider them.

Any help will be greatly appreciated.

Upvotes: 1

Views: 1050

Answers (1)

rfortin
rfortin

Reputation: 184

You're close! Just use dplyr to group by hospital

dataset_lagged <- dataset %>%
    group_by(Hospital,Question) %>%
    mutate(`ScoreYearN-1` = lag(ScoreYearN),
           `TotalYearN-1` = lag(TotalYearN))

Upvotes: 3

Related Questions