basti41a
basti41a

Reputation: 193

R: How to change column type of a df in a list and combine list to df with bind_rows()

I´m having multiple dataframes/tibbles inside a list, like this: (in the real dataset it´s more like 500 df with 180 columns and 30 rows each)

df1 <- data.frame(Col_1 = c(0,0,0,0,0),
                  Col_2 = c(1,1,1,1,1),
                  Col_3 = c("text", "text", "text", "text", "text"))

df2 <- data.frame(Col_1 = c(0,0,0,0,0),
                  Col_2 = c(1,1,1,1,1),
                  Col_3 = c(2,2,2,2,2))
l <- list(df1, df2) 

The reason for this is, because I´m using readxl over multiple excel files. In principle those excel files/columns are the same, but some columns are imported as character or double. This is caused by user input.

In the end I want a big dataframe with all the df binded by bind_rows() or another function.

By simply using dplyr::bind_rows(l) there will be an error (Error: Can't combine `..1$Col_3` <character> and `..2$Col_3` <double>.), because I´m having different class types. To solve this problem, I´m using this approach:

l <- lapply(l, function(df) dplyr::mutate_at(df, vars(matches("Col_3")), as.character))

and afterwards this:

df <- dplyr::bind_rows(l)

which results in my desired df using this simple example.

BUT if I want to use the lapply function in my "real" dataset, this error always occurs:

Error: Can't transform a data frame with duplicate names.
  1. How could I enclose the problem? (I can´t share the dataset because of confidentiality reasons, but I couldn´t reproduce this error in the above mentioned example)

  2. Is there maybe a better way/function to convert this list into one df? Maybe convert automatically all column types to character (for now this would work, but this is of course not a good choice in the long run)

Upvotes: 3

Views: 3556

Answers (3)

Sam
Sam

Reputation: 518

library dplyr
library purr

I would first create a list of all the files to be combined

csv_files = list.files(path = (paste0(data_path,"folder_with_csvs/")), pattern = "csv$", full.names = TRUE)

Then choose from the following options (second is probably preferred). First option is to have every column as a character.

df_all <- map_dfr(.x = set_names(csv_files),
              .f = ~ read_csv(.x, col_types = cols(.default = "c")))

Second option is have the default parsing/guessing for all columns except your exception(s).

df_all <- map_dfr(.x = set_names(csv_files_new),
          .f = ~ read_csv(.x, col_types = cols(.default = "?",Col_3 = "c")))

Third option is it classify each column.

df_all <- map_dfr(.x = set_names(csv_files_new),
              .f = ~ read_csv(.x, col_types = cols(Col_1 = "c", #char
                                                   Col_2 = "d", #double
                                                   Col_3 = "c"))) #char

see p. 34 of http://www.hiercourse.com/docs/Working_in_the_Tidyverse.pdf for more details

Upvotes: 1

r2evans
r2evans

Reputation: 160447

Up front, the best fix for this is to specify the column class when importing the data. By far, the function I propose below should never be used when you have any semblance of control over the import process. The utility of this function is in the extreme times you need some normalization of classes.

normalize_attributes <- function(L) {
  all_nms <- unique(unlist(sapply(L, names)))
  L <- lapply(L, function(dat) {
    missing_nms <- setdiff(all_nms, names(dat))
    if (length(missing_nms)) dat[missing_nms] <- NA
    dat
  })
  first_nms <- names(L[[1]])

  reattrib_funcs <- setNames(vector("list", length(all_nms)), all_nms)

  LL_attr <- lapply(setNames(nm = all_nms),
                    function(nm) lapply(L, function(dat) attributes(dat[[nm]])))
  LL_first <- lapply(setNames(nm = all_nms),
                     function(nm) lapply(L, function(dat) dat[[nm]][1]))

  for (nm in all_nms) {
    haschr <- sapply(LL_first[[nm]], inherits, "character")
    hasnum <- sapply(LL_first[[nm]], inherits, "numeric")
    haspsx <- sapply(LL_first[[nm]], inherits, "POSIXt")
    hasdate <- sapply(LL_first[[nm]], inherits, "Date")
    # use psx if any present otherwise date
    hastime <- if (any(haspsx)) haspsx else hasdate

    if (any(haschr)) {
      # character wins all
      reattrib_funcs[[nm]] <- as.character
    } else if (any(hastime)) {
      # need to use attributes here; this allows up-conversion without
      # needing to specify the 'origin' (this might be a bug!)
      att <- LL_attr[[nm]][[ which.max(hastime) ]]
      reattrib_funcs[[nm]] <- substitute(function(vec) `attributes<-`(vec, att))
    } else if (any(hasnum)) {
      reattrib_funcs[[nm]] <- as.numeric
    } else {
      cls <- class(unlist(LL_first[[nm]]))
      reattrib_funcs[[nm]] <- substitute(function(vec) `class<-`(vec, cls))
    }
  }

  L <- lapply(L, function(dat) {
    dat[all_nms] <- Map(function(func, vec) eval(func)(vec),
                        reattrib_funcs[all_nms], dat[all_nms])
    dat
  })

}

Some cruel data, showing various combinations of data classes/types:

df1 <- data.frame(int_int = c(0L,0L,0L,0L,0L),
                  num_num = c(0,0,0,0,0),
                  num_int = c(0,0,0,0,0),
                  num_psx = c(0,0,0,0,0),
                  num_dat = c(0,0,0,0,0),
                  chr_num = c("text", "text", "text", "text", "text"),
                  psx_dat = rep(Sys.time(),5),
                  chr_mis = c(0,0,0,0,0))

df2 <- data.frame(int_int = c(1L,1L,1L,1L,1L),
                  num_num = c(1,1,1,1,1),
                  num_int = c(1L,1L,1L,1L,1L),
                  num_psx = rep(Sys.time(),5),
                  num_dat = rep(Sys.Date(),5),
                  psx_dat = rep(Sys.Date(),5),
                  chr_num = c(1,1,1,1,1))

l <- list(df1, df2)
str(l)
# List of 2
#  $ :'data.frame': 5 obs. of  8 variables:
#   ..$ int_int: int [1:5] 0 0 0 0 0
#   ..$ num_num: num [1:5] 0 0 0 0 0
#   ..$ num_int: num [1:5] 0 0 0 0 0
#   ..$ num_psx: num [1:5] 0 0 0 0 0
#   ..$ num_dat: num [1:5] 0 0 0 0 0
#   ..$ chr_num: chr [1:5] "text" "text" "text" "text" ...
#   ..$ psx_dat: POSIXct[1:5], format: "2021-02-11 10:51:52" "2021-02-11 10:51:52" "2021-02-11 10:51:52" "2021-02-11 10:51:52" ...
#   ..$ chr_mis: num [1:5] 0 0 0 0 0
#  $ :'data.frame': 5 obs. of  7 variables:
#   ..$ int_int: int [1:5] 1 1 1 1 1
#   ..$ num_num: num [1:5] 1 1 1 1 1
#   ..$ num_int: int [1:5] 1 1 1 1 1
#   ..$ num_psx: POSIXct[1:5], format: "2021-02-11 10:51:52" "2021-02-11 10:51:52" "2021-02-11 10:51:52" "2021-02-11 10:51:52" ...
#   ..$ num_dat: Date[1:5], format: "2021-02-11" "2021-02-11" "2021-02-11" "2021-02-11" ...
#   ..$ psx_dat: Date[1:5], format: "2021-02-11" "2021-02-11" "2021-02-11" "2021-02-11" ...
#   ..$ chr_num: num [1:5] 1 1 1 1 1

(The column names indicate the different types/classes.) Note the problems with those frames:

  • "chr_mis" is missing in one;
  • columns are in a different order, "chr_num" notably; and obviously
  • the classes are rarely the same :-)

I expect from this that:

  • if any column is a string, all frames have that column as a string;
  • if a column is POSIXt or Date (special-case numeric due to attributes), all should be;
  • num and int go to num, normal R behavior;
  • missing columns are added, using the appropriate R NA class (there are at least six types of NA)

And the fixed data:

str(l2 <- normalize_attributes(l))
# List of 2
#  $ :'data.frame': 5 obs. of  8 variables:
#   ..$ int_int: int [1:5] 0 0 0 0 0
#   ..$ num_num: num [1:5] 0 0 0 0 0
#   ..$ num_int: num [1:5] 0 0 0 0 0
#   ..$ num_psx: POSIXct[1:5], format: "1969-12-31 19:00:00" "1969-12-31 19:00:00" "1969-12-31 19:00:00" "1969-12-31 19:00:00" ...
#   ..$ num_dat: Date[1:5], format: "1970-01-01" "1970-01-01" "1970-01-01" "1970-01-01" ...
#   ..$ chr_num: chr [1:5] "text" "text" "text" "text" ...
#   ..$ psx_dat: POSIXct[1:5], format: "2021-02-11 11:43:31" "2021-02-11 11:43:31" "2021-02-11 11:43:31" "2021-02-11 11:43:31" ...
#   ..$ chr_mis: num [1:5] 0 0 0 0 0
#  $ :'data.frame': 5 obs. of  8 variables:
#   ..$ int_int: int [1:5] 1 1 1 1 1
#   ..$ num_num: num [1:5] 1 1 1 1 1
#   ..$ num_int: num [1:5] 1 1 1 1 1
#   ..$ num_psx: POSIXct[1:5], format: "2021-02-11 11:43:31" "2021-02-11 11:43:31" "2021-02-11 11:43:31" "2021-02-11 11:43:31" ...
#   ..$ num_dat: Date[1:5], format: "2021-02-11" "2021-02-11" "2021-02-11" "2021-02-11" ...
#   ..$ psx_dat: POSIXct[1:5], format: "1970-01-01 00:11:09" "1970-01-01 00:11:09" "1970-01-01 00:11:09" "1970-01-01 00:11:09" ...
#   ..$ chr_num: chr [1:5] "1" "1" "1" "1" ...
#   ..$ chr_mis: num [1:5] NA NA NA NA NA

which can now be rbinded more safely:

do.call(rbind, l2)
#    int_int num_num num_int             num_psx    num_dat chr_num             psx_dat chr_mis
# 1        0       0       0 1969-12-31 19:00:00 1970-01-01    text 2021-02-11 11:43:31       0
# 2        0       0       0 1969-12-31 19:00:00 1970-01-01    text 2021-02-11 11:43:31       0
# 3        0       0       0 1969-12-31 19:00:00 1970-01-01    text 2021-02-11 11:43:31       0
# 4        0       0       0 1969-12-31 19:00:00 1970-01-01    text 2021-02-11 11:43:31       0
# 5        0       0       0 1969-12-31 19:00:00 1970-01-01    text 2021-02-11 11:43:31       0
# 6        1       1       1 2021-02-11 11:43:31 2021-02-11       1 1970-01-01 00:11:09      NA
# 7        1       1       1 2021-02-11 11:43:31 2021-02-11       1 1970-01-01 00:11:09      NA
# 8        1       1       1 2021-02-11 11:43:31 2021-02-11       1 1970-01-01 00:11:09      NA
# 9        1       1       1 2021-02-11 11:43:31 2021-02-11       1 1970-01-01 00:11:09      NA
# 10       1       1       1 2021-02-11 11:43:31 2021-02-11       1 1970-01-01 00:11:09      NA

Upvotes: 0

www
www

Reputation: 39154

Change everything columns to character, and then you can combine the data frames.

library(dplyr)
library(purrr)

df_all <- map_dfr(l, ~.x %>% mutate(across(everything(), as.character)))

A better way could be that when you read the data into R, make sure all columns are character.

Upvotes: 1

Related Questions