Biased_Observer
Biased_Observer

Reputation: 87

How to re-structure a table into a specific template format in R?

I have a dataset containing results from a survey. Let's pretend that a survey was sent out to thousands of employees belonging to a number of different companies, I processed the results of these surveys, identified some errors in those surveys and now want to send a custom error summary to each employee, so that they can correct those errors.

To send out these summaries, we use a software which allows you to send out a custom email, using a template where you can specify custom fields.

E.g.


Dear (Name),

We have identified a total of (number of errors) errors in the the surveys submitted by (company_name). Please find these below:

(error_1_description)

(error_1_survey_IDs)

(error_2_description)

(error_2_survey_IDs)

(error_3_description)

(error_3_survey_IDs)

(error_4_description)

(error_4_survey_IDs)


When sent, the recipient sees a summary specific to their company, e.g. :

Dear Steve,

We have identified a total of 20 errors in the the surveys submitted by Amazon. Please find these below:

Error in question 1. IDs of affected surveys:

00100A, 00100B, 00100C

Error in question 2. IDs of affected surveys:

00100A, 00100B

Error in question 3. IDs of affected surveys:

00100A

Error in question 4. IDs of affected surveys:

00100B, 00100C


My problem is that I need to re-structure the error summary into the template format accepted by the software, and I am struggling to find a way.

The table containing the error summary can be re-created using the code below:

error_summary <- structure(list(organisation = c("Amazon", "Amazon", "Amazon", 
"Amazon", "Amazon", "Amazon", "Amazon", "Amazon", "Amazon", "Google", 
"Google", "Google", "Google", "Google", "Google", "Google", "Google", 
"Google", "Google", "Google", "Google", "Google", "Google", "Facebook", 
"Facebook", "Facebook", "Facebook", "Facebook", "Facebook", "Facebook", 
"Facebook", "Facebook", "Facebook", "Facebook", "Facebook", "Facebook"
), questionnaire_id = c("00100A", "00100A", "00100A", "00100B", 
"00100C", "00100C", "00100C", "00100D", "00100D", "00100E", "00100E", 
"00100E", "00100F", "00100G", "00100G", "00100G", "00100H", "00100H", 
"00100H", "00100H", "00100H", "00100J", "00100J", "00100K", "00100K", 
"00100K", "00100K", "00100L", "00100L", "00100L", "00100L", "00100M", 
"00100M", "00100M", "00100M", "00100M"), error_message = c("error found in question 1", 
"error found in question 2", "error found in question 4", "error found in question 1", 
"error found in question 2", "error found in question 5", "error found in question 6", 
"error found in question 1", "error found in question 2", "error found in question 1", 
"error found in question 2", "error found in question 4", "error found in question 1", 
"error found in question 2", "error found in question 5", "error found in question 6", 
"error found in question 1", "error found in question 2", "error found in question 3", 
"error found in question 4", "error found in question 5", "error found in question 5", 
"error found in question 6", "error found in question 1", "error found in question 2", 
"error found in question 4", "error found in question 5", "error found in question 2", 
"error found in question 5", "error found in question 6", "error found in question 7", 
"error found in question 2", "error found in question 3", "error found in question 4", 
"error found in question 5", "error found in question 6")), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -36L), spec = structure(list(
    cols = list(organisation = structure(list(), class = c("collector_character", 
    "collector")), questionnaire_id = structure(list(), class = c("collector_character", 
    "collector")), error_message = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))

This table contains multiple rows per company, and only three columns.

For the template, the data needs to be re-structured so that it contains only one row per company, and each error and list of survey IDs containing that error in their own separate columns.

For example, ideally the above would end up like the below, where each column corresponds to a custom field which can be specified in the text body of the template:

end_goal_template <- structure(list(organisation = c("Amazon", "Google", "Facebook"
), error_1 = c("error found in question 1", "error found in question 1", 
"error found in question 1"), error_1_survey_IDs = c("00100A 00100B 00100D", 
NA, "00100K"), error_2 = c("error found in question 2", "error found in question 2", 
"error found in question 2"), error_2_survey_IDs = c("00100A 00100C 00100D", 
NA, "00100K 00100L 00100M"), error_3 = c("error found in question 3", 
"error found in question 3", "error found in question 3"), error_3_survey_IDs = c(NA, 
NA, "00100M"), error_4 = c("error found in question 4", "error found in question 4", 
"error found in question 4"), error_4_survey_IDs = c("00100A", 
NA, "00100K 00100M"), error_5 = c("error found in question 5", 
"error found in question 5", "error found in question 5"), error_5_survey_IDs = c("00100C", 
NA, "00100K 00100L 00100M"), error_6 = c("error found in question 6", 
"error found in question 6", "error found in question 6"), error_6_survey_IDs = c("00100C", 
NA, "00100L 00100M"), error_7 = c("error found in question 7", 
"error found in question 7", "error found in question 7"), error_7_survey_IDs = c(NA, 
NA, "00100L")), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L), spec = structure(list(cols = list(
    organisation = structure(list(), class = c("collector_character", 
    "collector")), error_1 = structure(list(), class = c("collector_character", 
    "collector")), error_1_survey_IDs = structure(list(), class = c("collector_character", 
    "collector")), error_2 = structure(list(), class = c("collector_character", 
    "collector")), error_2_survey_IDs = structure(list(), class = c("collector_character", 
    "collector")), error_3 = structure(list(), class = c("collector_character", 
    "collector")), error_3_survey_IDs = structure(list(), class = c("collector_character", 
    "collector")), error_4 = structure(list(), class = c("collector_character", 
    "collector")), error_4_survey_IDs = structure(list(), class = c("collector_character", 
    "collector")), error_5 = structure(list(), class = c("collector_character", 
    "collector")), error_5_survey_IDs = structure(list(), class = c("collector_character", 
    "collector")), error_6 = structure(list(), class = c("collector_character", 
    "collector")), error_6_survey_IDs = structure(list(), class = c("collector_character", 
    "collector")), error_7 = structure(list(), class = c("collector_character", 
    "collector")), error_7_survey_IDs = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 1), class = "col_spec"))

By having all the information in a single row, the software can substitute the information in the template for the custom fields, hence resulting in a custom email (there would also be a email field in the template but I excluded this for demo purposes).

I have wrestled with this all day, mostly using tidyr functions like pivot_wider but I feel like this is beyond my current abilities. Would be hugely grateful for any pointers or ideas, please and thank you!

Upvotes: 0

Views: 45

Answers (2)

user12728748
user12728748

Reputation: 8506

Here is a data.table solution:

library(data.table)
end_goal <- dcast(data.table(error_summary), organisation ~ error_message, 
    value.var="questionnaire_id", fun.aggregate = paste, collapse=", ", fill=NA)
setnames(end_goal, 
    sub("(error found in question )(.*)", "error_\\2_survey_IDs", colnames(end_goal)))
cn <- colnames(end_goal)[-1]
end_goal[,(sub("_survey_IDs", "", cn)):=data.table(t(cn))]
setcolorder(end_goal,  c("organisation", 
    as.vector(t(matrix(colnames(end_goal)[-1], ncol=2)[, 2:1]))))[]

Upvotes: 1

A. S. K.
A. S. K.

Reputation: 2816

Here's a pivot_wider solution. The columns aren't in the same order as your template (and don't quite have the same names), but this ought to get you 90% of the way there.

library(tidyverse)
error_summary %>%
  group_by(organisation, error_message) %>%
  summarise(survey_IDs = paste(questionnaire_id, collapse = " ")) %>%
  ungroup() %>%
  mutate(error = gsub(" found in question ", "_", error_message)) %>%
  rename(message = error_message) %>%
  group_by(organisation) %>%
  pivot_wider(id_cols = "organisation", names_from = error,
              values_from = c(message, survey_IDs),
              names_glue = "{error}_{.value}")

Upvotes: 2

Related Questions