daaronr
daaronr

Reputation: 517

Reconcile dataset *column types* (formats) using a dictionary/list in R/dplyr

Following on the renaming request #67453183 I want to do the same for formats using the dictionary, because it won't bring together columns of distinct types.

I have a series of data sets and a dictionary to bring these together. But I'm struggling to figure out how to automate this. > Suppose this data and dictionary (actual one is much longer, thus I want to automate):


mtcarsA <- mtcars[1:2,1:3] %>% rename(mpgA = mpg, cyl_A = cyl) %>% as_tibble()
mtcarsB <- mtcars[3:4,1:3] %>% rename(mpg_B = mpg, B_cyl = cyl) %>% as_tibble()
mtcarsB$B_cyl <- as.factor(mtcarsB$B_cyl)

dic <- tibble(true_name  = c("mpg_true", "cyl_true"), 
              nameA = c("mpgA", "cyl_A"), 
              nameB = c("mpg_B", "B_cyl"),
              true_format = c("factor", "numeric")
)

I want these datasets (from years A and B) appended to one another, and then to have the names changed or coalesced to the 'true_name' values.... I want to automate 'coalesce all columns with duplicate names'.

And to bring these together, the types need to be the same too. I'm giving the entire problem here because perhaps someone also has a better solution for 'using a data dictionary'.

@ronakShah in the previous query proposed

pmap(dic, ~setNames(..1, paste0(c(..2, ..3), collapse = '|'))) %>%
  flatten_chr() -> val

mtcars_all <- list(mtcarsA,mtcarsB) %>%
  map_df(function(x) x %>% rename_with(~str_replace_all(.x, val)))

Which works great in the previous example but not if the formats vary. Here it throws error:

Error: Can't combine ..1$cyl_true<double> and..2$cyl_true <factor<51fac>>.

This response to #56773354 offers a related solution if one has a complete list of types, but not for a type list by column name, as I have.

Desired output:

mtcars_all
# A tibble: 4 x 3

mpg_true cyl_true  disp
  <factor> <numeric> <dbl>
1    21     6   160
2    21     6   160
3    22.8   4   108
4    21.4   6   258

Upvotes: 1

Views: 139

Answers (2)

dash2
dash2

Reputation: 2262

Something simpler:

library(magrittr) # %<>% is cool
library(dplyr)

# The renaming is easy:

renameA <- dic$nameA
renameB <- dic$nameB
names(renameA) <- dic$true_name
names(renameB) <- dic$true_name

mtcarsA %<>% rename(all_of(renameA))
mtcarsB %<>% rename(all_of(renameB))

# Formatting is a little harder:

formats <- dic$true_format
names(formats) <- dic$true_name

lapply(names(formats), function (x) {
  # there's no nice programmatic way to do this, I think
  coercer <- switch(formats[[x]], 
                      factor = as.factor,
                      numeric = as.numeric,
                      warning("Unrecognized format") 
                    )
  mtcarsA[[x]] <<- coercer(mtcarsA[[x]])
  mtcarsB[[x]] <<- coercer(mtcarsB[[x]])
})

mtcars_all <- bind_rows(mtcarsA, mtcarsB)

In the background you should be aware of how base R treated concatenating factors before 4.1.0, and how this'll change. Here it probably doesn't matter because bind_rows will use the vctrs package.

Upvotes: 2

Dan Chaltiel
Dan Chaltiel

Reputation: 8484

I took another approach than Ronak's to read the dictionary. It is more verbose but I find it a bit more readable. A benchmark would be interesting to see which one is faster ;-)

Unfortunately, it seems that you cannot blindly cast a variable to a factor so I switched to character instead. In practice, it should behave exactly like a factor and you can call as_factor() on the end object if this is very important to you. Another possibility would be to store a casting function name (such as as_factor()) in the dictionary, retrieve it using get() and use it instead of as().

library(tidyverse)

mtcarsA <- mtcars[1:2,1:3] %>% rename(mpgA = mpg, cyl_A = cyl) %>% as_tibble()
mtcarsB <- mtcars[3:4,1:3] %>% rename(mpg_B = mpg, B_cyl = cyl) %>% as_tibble()
mtcarsB$B_cyl <- as.factor(mtcarsB$B_cyl)

dic <- tibble(true_name  = c("mpg_true", "cyl_true"), 
              nameA = c("mpgA", "cyl_A"), 
              nameB = c("mpg_B", "B_cyl"),
              true_format = c("numeric", "character") #instead of factor
)

dic2 = dic %>% 
  pivot_longer(-c(true_name, true_format), names_to=NULL)

read_dic = function(key, dict=dic2){
  x = dict[dict$value==key,][["true_name"]]
  if(length(x)!=1) x=key
  x
}

rename_from_dic = function(df, dict=dic2){
  rename_with(df, ~{
    map_chr(.x, ~read_dic(.x, dict))
  })
}

cast_from_dic = function(df, dict=dic){
  mutate(df, across(everything(), ~{
    cl=dict[dict$true_name==cur_column(),][["true_format"]]
    if(length(cl)!=1) cl=class(.x)
    as(.x, cl, strict=FALSE)
  }))
}

list(mtcarsA,mtcarsB) %>% 
  map(rename_from_dic) %>% 
  map_df(cast_from_dic)
#> # A tibble: 4 x 3
#>   mpg_true cyl_true  disp
#>      <dbl> <chr>    <dbl>
#> 1     21   6          160
#> 2     21   6          160
#> 3     22.8 4          108
#> 4     21.4 6          258

Created on 2021-05-09 by the reprex package (v2.0.0)

Upvotes: 1

Related Questions