Reputation: 1321
I've got a dataframe that has monthly survey scores for a certain hospitals. Each month, we store the score obtained by the hospital (_Score
column) and the corresponding average score for all hospitals for that month (_Average
column).
Here's a short sample of what it looks like -
df = data.frame(Hospital=c(rep("Hospital A",10),rep("Hospital B",10),rep("Hospital C",10),rep("Hospital D",10)),
Question=c(rep("Q1",40)),
key=c(rep(c("2020-01-31_Average","2020-01-31_Score","2020-02-29_Average","2020-02-29_Score",
"2020-03-31_Average","2020-03-31_Score","2020-04-30_Average","2020-04-30_Score",
"2020-05-31_Average","2020-05-31_Score"),4)),
value=c(round(runif(40,0,1),2)))
library(tidyr)
df = df %>% spread(key,value)
I would like to transform this dataframe such that -
1) The first two columns, Hospital
and Question
remain the same
2) _Score
columns for the three most recent months only are kept
3) _Average
column for the most recent month is kept
4) Ideally, the columns need to be reordered from oldest to most recent (i.e. in the following order: Month M-2_Score
, Month M-1_Score
, Month M_Score
, Month M_Average
)
5) Calculate a column Variance
at the end, which is the difference between Score M and Score M-1
What I'm trying to achieve
Using dplyr, this can be done manually by reordering the columns. But I'm looking for a way to build a logic that automatically reorders columns for the 3 most recent months in the sequence described above. By taking the date values embedded in the column names and reordering according to them.
The resulting table would look like this -
#Final table
df_transformed = df %>%
select(1:2,8,10,12,11) %>%
mutate(Variance=.[[5]]-.[[4]])
Any tips on how to do this more efficiently using the date values in the column names would be highly appreciated.
Upvotes: 0
Views: 392
Reputation: 9247
This is a possible solution if the columns in your dataset are already ordered chronologically
# create vectors of variables: 3 last "_Score" and 1 last "_Average"
score_vars <- tail(names(df)[grep("_Score", names(df))], 3)
average_var <- tail(names(df)[grep("_Average", names(df))], 1)
df %>%
select(Hospital, Question, !!score_vars, !!average_var) %>%
mutate(Variance = !!rlang::sym(score_vars[3]) - !!rlang::sym(score_vars[2]))
Output
# Hospital Question 2020-03-31_Score 2020-04-30_Score 2020-05-31_Score 2020-05-31_Average Variance
# 1 Hospital A Q1 0.28 0.69 0.31 0.94 -0.38
# 2 Hospital B Q1 0.19 0.41 0.27 0.91 -0.14
# 3 Hospital C Q1 0.53 0.03 0.25 0.05 0.22
# 4 Hospital D Q1 0.43 0.59 0.46 0.36 -0.13
Upvotes: 1
Reputation: 388982
I have used your original df
in long format before the spread
step.
library(dplyr)
library(tidyr)
df %>%
#Bring date and key in separate columns
separate(key, c('Date', 'key'), sep = '_') %>%
#Convert date column to date class
mutate(Date = as.Date(Date)) %>%
#arrange data according with highest date first
arrange(Hospital, key, desc(Date)) %>%
#For each hospital and key
group_by(Hospital, key) %>%
#If it is a "score" column select top 3 values and
#for average column select only 1 value
slice(if(first(key) == 'Score') 1:3 else 1) %>%
select(-Question) %>%
ungroup() %>%
#Get the data in wide format
pivot_wider(names_from = c(key, Date), values_from = value) %>%
#Calculate variance column
mutate(Variance = .[[3]] - .[[4]])
# A tibble: 4 x 6
# Hospital `Average_2020-05-31` `Score_2020-05-31` `Score_2020-04-30` `Score_2020-03-31` Variance
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Hospital A 0.45 0.44 0.66 0.97 -0.22
#2 Hospital B 0.11 0.53 0.68 0.27 -0.15
#3 Hospital C 1 0.18 0.56 0.41 -0.38
#4 Hospital D 0.31 0.83 0.6 0.79 0.23
To calculate variance .[[3]] - .[[4]]
will be fixed because "Hospital"
column is fixed and would always be 1st column. "Average"
column would come before "Score"
column (alphabetically) and since the data is sorted by Date
, we know that highest date would be placed first, then second-highest and so on.
Upvotes: 0
Reputation: 8117
I don't really get questions 4 and 5, but they feel a little like "Can you do my homework, please?". For questions 1 to 3, consider this:
library(tidyverse)
library(lubridate)
df <- data.frame(Hospital=c(rep("Hospital A",10),rep("Hospital B",10),rep("Hospital C",10),rep("Hospital D",10)),
Question=c(rep("Q1",40)),
key=c(rep(c("2020-01-31_Average","2020-01-31_Score","2020-02-29_Average","2020-02-29_Score",
"2020-03-31_Average","2020-03-31_Score","2020-04-30_Average","2020-04-30_Score",
"2020-05-31_Average","2020-05-31_Score"),4)),
value=c(round(runif(40,0,1),2)))
# take the dataframe
df %>%
# get month as a date and key separately
mutate(month = str_replace(key, "_[[:alpha:]]*$", "") %>% ymd()
, key = str_extract(key, "[[:alpha:]]*$")) %>%
# filter Score for the last 3 and Average for the last 1 months
filter(!(today() - month > months(3) &
key == "Score")) %>%
filter(!(today() - month > months(1) &
key == "Average"))
Upvotes: 0