Varun
Varun

Reputation: 1321

R Rearrange columns in dataframe based on date values in column names

I've got a dataframe that has monthly survey scores for a certain hospitals. Each month, we store the score obtained by the hospital (_Score column) and the corresponding average score for all hospitals for that month (_Average column).

Here's a short sample of what it looks like -

df = data.frame(Hospital=c(rep("Hospital A",10),rep("Hospital B",10),rep("Hospital C",10),rep("Hospital D",10)),
                Question=c(rep("Q1",40)),
                key=c(rep(c("2020-01-31_Average","2020-01-31_Score","2020-02-29_Average","2020-02-29_Score",
                      "2020-03-31_Average","2020-03-31_Score","2020-04-30_Average","2020-04-30_Score",
                      "2020-05-31_Average","2020-05-31_Score"),4)),
                value=c(round(runif(40,0,1),2)))

library(tidyr)
df = df %>% spread(key,value)

I would like to transform this dataframe such that -

1) The first two columns, Hospital and Question remain the same

2) _Score columns for the three most recent months only are kept

3) _Average column for the most recent month is kept

4) Ideally, the columns need to be reordered from oldest to most recent (i.e. in the following order: Month M-2_Score, Month M-1_Score, Month M_Score, Month M_Average)

5) Calculate a column Variance at the end, which is the difference between Score M and Score M-1

What I'm trying to achieve

Using dplyr, this can be done manually by reordering the columns. But I'm looking for a way to build a logic that automatically reorders columns for the 3 most recent months in the sequence described above. By taking the date values embedded in the column names and reordering according to them.

The resulting table would look like this -

#Final table
df_transformed = df %>%
  select(1:2,8,10,12,11) %>%
  mutate(Variance=.[[5]]-.[[4]])

Any tips on how to do this more efficiently using the date values in the column names would be highly appreciated.

Upvotes: 0

Views: 392

Answers (3)

Ric S
Ric S

Reputation: 9247

This is a possible solution if the columns in your dataset are already ordered chronologically

# create vectors of variables: 3 last "_Score" and 1 last "_Average"
score_vars <- tail(names(df)[grep("_Score", names(df))], 3)
average_var <- tail(names(df)[grep("_Average", names(df))], 1)

df %>% 
  select(Hospital, Question, !!score_vars, !!average_var) %>% 
  mutate(Variance = !!rlang::sym(score_vars[3]) - !!rlang::sym(score_vars[2]))

Output

# Hospital Question 2020-03-31_Score 2020-04-30_Score 2020-05-31_Score 2020-05-31_Average Variance
# 1 Hospital A       Q1             0.28             0.69             0.31               0.94    -0.38
# 2 Hospital B       Q1             0.19             0.41             0.27               0.91    -0.14
# 3 Hospital C       Q1             0.53             0.03             0.25               0.05     0.22
# 4 Hospital D       Q1             0.43             0.59             0.46               0.36    -0.13

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388982

I have used your original df in long format before the spread step.

library(dplyr)
library(tidyr)

df %>%
  #Bring date and key in separate columns
  separate(key, c('Date', 'key'), sep = '_') %>%
  #Convert date column to date class
  mutate(Date = as.Date(Date)) %>%
  #arrange data according with highest date first
  arrange(Hospital, key, desc(Date)) %>%
  #For each hospital and key
  group_by(Hospital, key) %>%
  #If it is a "score" column select top 3 values and 
  #for average column select only 1 value
  slice(if(first(key) == 'Score') 1:3 else 1) %>%
  select(-Question) %>%
  ungroup() %>%
  #Get the data in wide format
  pivot_wider(names_from = c(key, Date), values_from = value) %>%
  #Calculate variance column
  mutate(Variance = .[[3]] - .[[4]])

# A tibble: 4 x 6
#  Hospital   `Average_2020-05-31` `Score_2020-05-31` `Score_2020-04-30` `Score_2020-03-31` Variance
#  <chr>                     <dbl>              <dbl>              <dbl>              <dbl>    <dbl>
#1 Hospital A                 0.45               0.44               0.66               0.97    -0.22
#2 Hospital B                 0.11               0.53               0.68               0.27    -0.15
#3 Hospital C                 1                  0.18               0.56               0.41    -0.38
#4 Hospital D                 0.31               0.83               0.6                0.79     0.23

To calculate variance .[[3]] - .[[4]] will be fixed because "Hospital" column is fixed and would always be 1st column. "Average" column would come before "Score" column (alphabetically) and since the data is sorted by Date, we know that highest date would be placed first, then second-highest and so on.

Upvotes: 0

Georgery
Georgery

Reputation: 8117

I don't really get questions 4 and 5, but they feel a little like "Can you do my homework, please?". For questions 1 to 3, consider this:

library(tidyverse)
library(lubridate)

df <- data.frame(Hospital=c(rep("Hospital A",10),rep("Hospital B",10),rep("Hospital C",10),rep("Hospital D",10)),
                Question=c(rep("Q1",40)),
                key=c(rep(c("2020-01-31_Average","2020-01-31_Score","2020-02-29_Average","2020-02-29_Score",
                      "2020-03-31_Average","2020-03-31_Score","2020-04-30_Average","2020-04-30_Score",
                      "2020-05-31_Average","2020-05-31_Score"),4)),
                value=c(round(runif(40,0,1),2)))

# take the dataframe
df %>%
    # get month as a date and key separately
    mutate(month = str_replace(key, "_[[:alpha:]]*$", "") %>% ymd()
           , key = str_extract(key, "[[:alpha:]]*$")) %>%
    # filter Score for the last 3 and Average for the last 1 months
    filter(!(today() - month > months(3) & 
                 key == "Score")) %>%
    filter(!(today() - month > months(1) &
                 key == "Average"))

Upvotes: 0

Related Questions