Reputation: 193
I´m having multiple dataframes/tibbles inside a list, like this: (in the real dataset it´s more like 500 df with 180 columns and 30 rows each)
df1 <- data.frame(Col_1 = c(0,0,0,0,0),
Col_2 = c(1,1,1,1,1),
Col_3 = c("text", "text", "text", "text", "text"))
df2 <- data.frame(Col_1 = c(0,0,0,0,0),
Col_2 = c(1,1,1,1,1),
Col_3 = c(2,2,2,2,2))
l <- list(df1, df2)
The reason for this is, because I´m using readxl over multiple excel files. In principle those excel files/columns are the same, but some columns are imported as character or double. This is caused by user input.
In the end I want a big dataframe with all the df binded by bind_rows()
or another function.
By simply using dplyr::bind_rows(l)
there will be an error (Error: Can't combine `..1$Col_3` <character> and `..2$Col_3` <double>.
), because I´m having different class types. To solve this problem, I´m using this approach:
l <- lapply(l, function(df) dplyr::mutate_at(df, vars(matches("Col_3")), as.character))
and afterwards this:
df <- dplyr::bind_rows(l)
which results in my desired df using this simple example.
BUT if I want to use the lapply function in my "real" dataset, this error always occurs:
Error: Can't transform a data frame with duplicate names.
How could I enclose the problem? (I can´t share the dataset because of confidentiality reasons, but I couldn´t reproduce this error in the above mentioned example)
Is there maybe a better way/function to convert this list into one df? Maybe convert automatically all column types to character (for now this would work, but this is of course not a good choice in the long run)
Upvotes: 3
Views: 3556
Reputation: 518
library dplyr
library purr
I would first create a list of all the files to be combined
csv_files = list.files(path = (paste0(data_path,"folder_with_csvs/")), pattern = "csv$", full.names = TRUE)
Then choose from the following options (second is probably preferred). First option is to have every column as a character.
df_all <- map_dfr(.x = set_names(csv_files),
.f = ~ read_csv(.x, col_types = cols(.default = "c")))
Second option is have the default parsing/guessing for all columns except your exception(s).
df_all <- map_dfr(.x = set_names(csv_files_new),
.f = ~ read_csv(.x, col_types = cols(.default = "?",Col_3 = "c")))
Third option is it classify each column.
df_all <- map_dfr(.x = set_names(csv_files_new),
.f = ~ read_csv(.x, col_types = cols(Col_1 = "c", #char
Col_2 = "d", #double
Col_3 = "c"))) #char
see p. 34 of http://www.hiercourse.com/docs/Working_in_the_Tidyverse.pdf for more details
Upvotes: 1
Reputation: 160447
Up front, the best fix for this is to specify the column class when importing the data. By far, the function I propose below should never be used when you have any semblance of control over the import process. The utility of this function is in the extreme times you need some normalization of classes.
normalize_attributes <- function(L) {
all_nms <- unique(unlist(sapply(L, names)))
L <- lapply(L, function(dat) {
missing_nms <- setdiff(all_nms, names(dat))
if (length(missing_nms)) dat[missing_nms] <- NA
dat
})
first_nms <- names(L[[1]])
reattrib_funcs <- setNames(vector("list", length(all_nms)), all_nms)
LL_attr <- lapply(setNames(nm = all_nms),
function(nm) lapply(L, function(dat) attributes(dat[[nm]])))
LL_first <- lapply(setNames(nm = all_nms),
function(nm) lapply(L, function(dat) dat[[nm]][1]))
for (nm in all_nms) {
haschr <- sapply(LL_first[[nm]], inherits, "character")
hasnum <- sapply(LL_first[[nm]], inherits, "numeric")
haspsx <- sapply(LL_first[[nm]], inherits, "POSIXt")
hasdate <- sapply(LL_first[[nm]], inherits, "Date")
# use psx if any present otherwise date
hastime <- if (any(haspsx)) haspsx else hasdate
if (any(haschr)) {
# character wins all
reattrib_funcs[[nm]] <- as.character
} else if (any(hastime)) {
# need to use attributes here; this allows up-conversion without
# needing to specify the 'origin' (this might be a bug!)
att <- LL_attr[[nm]][[ which.max(hastime) ]]
reattrib_funcs[[nm]] <- substitute(function(vec) `attributes<-`(vec, att))
} else if (any(hasnum)) {
reattrib_funcs[[nm]] <- as.numeric
} else {
cls <- class(unlist(LL_first[[nm]]))
reattrib_funcs[[nm]] <- substitute(function(vec) `class<-`(vec, cls))
}
}
L <- lapply(L, function(dat) {
dat[all_nms] <- Map(function(func, vec) eval(func)(vec),
reattrib_funcs[all_nms], dat[all_nms])
dat
})
}
Some cruel data, showing various combinations of data classes/types:
df1 <- data.frame(int_int = c(0L,0L,0L,0L,0L),
num_num = c(0,0,0,0,0),
num_int = c(0,0,0,0,0),
num_psx = c(0,0,0,0,0),
num_dat = c(0,0,0,0,0),
chr_num = c("text", "text", "text", "text", "text"),
psx_dat = rep(Sys.time(),5),
chr_mis = c(0,0,0,0,0))
df2 <- data.frame(int_int = c(1L,1L,1L,1L,1L),
num_num = c(1,1,1,1,1),
num_int = c(1L,1L,1L,1L,1L),
num_psx = rep(Sys.time(),5),
num_dat = rep(Sys.Date(),5),
psx_dat = rep(Sys.Date(),5),
chr_num = c(1,1,1,1,1))
l <- list(df1, df2)
str(l)
# List of 2
# $ :'data.frame': 5 obs. of 8 variables:
# ..$ int_int: int [1:5] 0 0 0 0 0
# ..$ num_num: num [1:5] 0 0 0 0 0
# ..$ num_int: num [1:5] 0 0 0 0 0
# ..$ num_psx: num [1:5] 0 0 0 0 0
# ..$ num_dat: num [1:5] 0 0 0 0 0
# ..$ chr_num: chr [1:5] "text" "text" "text" "text" ...
# ..$ psx_dat: POSIXct[1:5], format: "2021-02-11 10:51:52" "2021-02-11 10:51:52" "2021-02-11 10:51:52" "2021-02-11 10:51:52" ...
# ..$ chr_mis: num [1:5] 0 0 0 0 0
# $ :'data.frame': 5 obs. of 7 variables:
# ..$ int_int: int [1:5] 1 1 1 1 1
# ..$ num_num: num [1:5] 1 1 1 1 1
# ..$ num_int: int [1:5] 1 1 1 1 1
# ..$ num_psx: POSIXct[1:5], format: "2021-02-11 10:51:52" "2021-02-11 10:51:52" "2021-02-11 10:51:52" "2021-02-11 10:51:52" ...
# ..$ num_dat: Date[1:5], format: "2021-02-11" "2021-02-11" "2021-02-11" "2021-02-11" ...
# ..$ psx_dat: Date[1:5], format: "2021-02-11" "2021-02-11" "2021-02-11" "2021-02-11" ...
# ..$ chr_num: num [1:5] 1 1 1 1 1
(The column names indicate the different types/classes.) Note the problems with those frames:
"chr_mis"
is missing in one;"chr_num"
notably; and obviouslyI expect from this that:
POSIXt
or Date
(special-case numeric
due to attributes), all should be;num
and int
go to num
, normal R behavior;NA
class (there are at least six types of NA
)And the fixed data:
str(l2 <- normalize_attributes(l))
# List of 2
# $ :'data.frame': 5 obs. of 8 variables:
# ..$ int_int: int [1:5] 0 0 0 0 0
# ..$ num_num: num [1:5] 0 0 0 0 0
# ..$ num_int: num [1:5] 0 0 0 0 0
# ..$ num_psx: POSIXct[1:5], format: "1969-12-31 19:00:00" "1969-12-31 19:00:00" "1969-12-31 19:00:00" "1969-12-31 19:00:00" ...
# ..$ num_dat: Date[1:5], format: "1970-01-01" "1970-01-01" "1970-01-01" "1970-01-01" ...
# ..$ chr_num: chr [1:5] "text" "text" "text" "text" ...
# ..$ psx_dat: POSIXct[1:5], format: "2021-02-11 11:43:31" "2021-02-11 11:43:31" "2021-02-11 11:43:31" "2021-02-11 11:43:31" ...
# ..$ chr_mis: num [1:5] 0 0 0 0 0
# $ :'data.frame': 5 obs. of 8 variables:
# ..$ int_int: int [1:5] 1 1 1 1 1
# ..$ num_num: num [1:5] 1 1 1 1 1
# ..$ num_int: num [1:5] 1 1 1 1 1
# ..$ num_psx: POSIXct[1:5], format: "2021-02-11 11:43:31" "2021-02-11 11:43:31" "2021-02-11 11:43:31" "2021-02-11 11:43:31" ...
# ..$ num_dat: Date[1:5], format: "2021-02-11" "2021-02-11" "2021-02-11" "2021-02-11" ...
# ..$ psx_dat: POSIXct[1:5], format: "1970-01-01 00:11:09" "1970-01-01 00:11:09" "1970-01-01 00:11:09" "1970-01-01 00:11:09" ...
# ..$ chr_num: chr [1:5] "1" "1" "1" "1" ...
# ..$ chr_mis: num [1:5] NA NA NA NA NA
which can now be rbind
ed more safely:
do.call(rbind, l2)
# int_int num_num num_int num_psx num_dat chr_num psx_dat chr_mis
# 1 0 0 0 1969-12-31 19:00:00 1970-01-01 text 2021-02-11 11:43:31 0
# 2 0 0 0 1969-12-31 19:00:00 1970-01-01 text 2021-02-11 11:43:31 0
# 3 0 0 0 1969-12-31 19:00:00 1970-01-01 text 2021-02-11 11:43:31 0
# 4 0 0 0 1969-12-31 19:00:00 1970-01-01 text 2021-02-11 11:43:31 0
# 5 0 0 0 1969-12-31 19:00:00 1970-01-01 text 2021-02-11 11:43:31 0
# 6 1 1 1 2021-02-11 11:43:31 2021-02-11 1 1970-01-01 00:11:09 NA
# 7 1 1 1 2021-02-11 11:43:31 2021-02-11 1 1970-01-01 00:11:09 NA
# 8 1 1 1 2021-02-11 11:43:31 2021-02-11 1 1970-01-01 00:11:09 NA
# 9 1 1 1 2021-02-11 11:43:31 2021-02-11 1 1970-01-01 00:11:09 NA
# 10 1 1 1 2021-02-11 11:43:31 2021-02-11 1 1970-01-01 00:11:09 NA
Upvotes: 0
Reputation: 39154
Change everything columns to character
, and then you can combine the data frames.
library(dplyr)
library(purrr)
df_all <- map_dfr(l, ~.x %>% mutate(across(everything(), as.character)))
A better way could be that when you read the data into R, make sure all columns are character
.
Upvotes: 1