Reputation: 147
Converting a list of data frames: not a simple rbind, second row to new columns
I have a list
employeesList = list(data.frame(first = ("Al"), second = "Jones"),
data.frame(first = c("Al", "Barb"), second = c("Jones", "Smith")),
data.frame(first = c("Al", "Barb", "Carol"), second = c("Jones", "Smith", "Adams")),
data.frame(first = ("Al"), second = "Jones"))
I am looking to produce this.
employeesDF = data.frame(first = c("Al", "Al", "Al", "Al"), second = c("Jones", "Jones", "Jones", "Jones"),
first2 = c(NA, "Barb", "Barb", NA), second2 = c(NA, "Smith", "Smith", NA),
first3 = c(NA, NA, "Carol", NA), second3 = c(NA, NA, "Adams", NA))
I would like each data frame to be a row in the results data frame. Note that the first data frame when converted will have two columns, the second df when converted will have four columns, the third df when converted will produce 6 columns, the fourth df when converted will produce 2 columns and so on. I realize there will have to be a fill of NA values I have done some research and the problem is solved if the goal were simply to rbind. I do not see a way to solve my problem. Convert a list of data frames into one data frame
After reading this, Combine two data frames by rows (rbind) when they have different sets of columns I got a start with
res1 = cbind(t(employeesList[[1]][1]), t(employeesList[[1]][2]))
res2 = cbind(t(employeesList[[2]][1]), t(employeesList[[2]][2]))
res3 = cbind(t(employeesList[[3]][1]), t(employeesList[[3]][2]))
res4 = cbind(t(employeesList[[4]][1]), t(employeesList[[4]][2]))
Then
library(dplyr)
bind_rows(list(res1, res2, res3, res4))
But I may have a very large number of data frames -- res1, …, resn. The number of columns is unspecified in advance, but will likely be fewer than 10. My procedure does not name the columns and I think that is needed for bind_rows.
Upvotes: 3
Views: 227
Reputation: 147
I made a mistake when I submitted my example data. It was not general enough in two different ways. The column names can vary inconsistently and the data can vary much more than indicated. I then asked my question at r-help. There the question was answered in multiple ways. Below are the solutions created by others along with my timing study.
# input data (list of data frames and data frames may have multiple rows)
employees4List = list(data.frame(first1 = "Al", second1 =
"Jones"),
data.frame(first2 = c("Al2", "Barb"),
second2 = c("Jones", "Smith")),
data.frame(first3 = c("Al3", "Barbara",
"Carol"),
second3 = c("Jones", "Smith",
"Adams")),
data.frame(first4 = ("Al"), second4 =
"Jones2"))
employees4List
# intermediate step (list of data frames with each just one row)
df1 = data.frame(First1 = "Al", Second1 = "Jones",
First2 = NA, Second2 = NA,
First3 = NA, Second3 = NA,
First4 = NA, Second4 = NA)
df2 = data.frame(First1 = "Al2", Second1 = "Jones",
First2 = "Barb", Second2 = "Smith",
First3 = NA, Second3 = NA,
First4 = NA, Second4 = NA)
df3 = data.frame(First1 = "Al3", Second1 = "Jones",
First2 = "Barbara", Second2 = "Smith",
First3 = "Carol", Second3 = "Adams",
First4 = NA, Second4 = NA)
df4 = data.frame(First1 = "Al", Second1 = "Jones2",
First2 = NA, Second2 = NA,
First3 = NA, Second3 = NA,
First4 = NA, Second4 = NA)
listFinal = list(df1, df2, df3, df4)
listFinal
# Expected final step, except that all columns should be character
# Just one data frame
dplyr::bind_rows(listFinal)
sapply(dplyr::bind_rows(listFinal), class)
# Solution 1 using base R by Sarah Goslee
dfbycol <- function(x) {
x <- lapply(x, function(y)as.vector(t(as.matrix(y))))
x <- lapply(x, function(y){length(y) <- max(sapply(x, length)); y})
x <- do.call(rbind, x)
x <- data.frame(x, stringsAsFactors=FALSE)
colnames(x) <- paste0(c("first", "last"), rep(seq(1, ncol(x)/2), each=2))
x
}
dfbycol(listFinal)
##########
# Solution 2 by Jeff Newmiller (Base R)
myrename2 <- function( DF, m ) {
# if a pair of columns is not present, raise an error
stopifnot( 2 == length( DF ) )
n <- nrow( DF )
# use memory layout of elements of matrix
# t() automatically converts to matrix (nrow=2)
# matrix(,nrow=1) re-interprets the column-major output of t()
# as a single row matrix
result <- as.data.frame( matrix( t( DF ), nrow = 1 )
, stringsAsFactors = FALSE
)
if ( n < m ) {
result[ , seq( 2 * n + 1, 2 * m ) ] <- NA
}
setNames( result
, sprintf( "%s%d"
, c( "First", "Second" )
, rep( seq.int( m ), each = 2 )
)
)
}
m <- max( unlist( lapply( employees4List, nrow ) ) )
listFinal2 <- lapply( employees4List, myrename2, m = m )
listFinal2
result2 <- do.call( rbind, listFinal2 )
result2
##########
# Solution 3 by Jeff Newmiller (uses dplyr)
myrename3 <- function( DF ) {
# if a pair of columns is not present, raise an error
stopifnot( 2 == length( DF ) )
n <- nrow( DF )
# use memory layout of elements of matrix
# t() automatically converts to matrix (nrow=2)
# matrix(,nrow=1) re-interprets the column-major output of t()
# as a single row matrix
setNames( as.data.frame( matrix( t( DF ), nrow = 1 )
, stringsAsFactors = FALSE
)
, sprintf( "%s%d"
, c( "First", "Second" )
, rep( seq.int( n ), each = 2 )
)
)
}
listFinal3 <- lapply( employees4List, myrename3 )
listFinal3
result3 <- dplyr::bind_rows( listFinal3 )
result3
# Solution 4 by Jeff Newmiller (uses dplyr and tidyr)
library(dplyr)
library(tidyr)
myrename4 <- function( DF ) {
# if a pair of columns is not present, raise an error
stopifnot( 2 == length( DF ) )
names( DF ) <- c( "a", "b" )
m <- nrow( DF )
( DF
%>% mutate_all( as.character )
%>% mutate( rw = LETTERS[ seq.int( n() ) ] )
%>% gather( col, val, -rw )
%>% tidyr::unite( "labels", rw, col, sep="" )
%>% spread( labels, val )
%>% setNames( sprintf( "%s%d"
, c( "First", "Second" )
, rep( seq.int( m ), each = 2 )
)
)
)
}
listFinal4 <- lapply( employees4List, myrename3)
listFinal4
result4 <- dplyr::bind_rows(listFinal4)
result4
#####
# Timing
# Create a large dataset
firsts = c("Al", "Barb", "Carol")
seconds = c("Washington", "Adams", "Jefferson" )
numReplications = 10000
set.seed(2018)
# Create data frames
sim_list1 = replicate(n = numReplications,
expr = {data.frame(first = base::sample(x = firsts, size = 1, replace = TRUE),
second = base::sample(x = seconds, size = 1, replace = TRUE))},
simplify = F)
sim_list2 = replicate(n = numReplications,
expr = {data.frame(first = base::sample(x = firsts, size = 2, replace = TRUE),
second = base::sample(x = seconds, size = 2, replace = TRUE))},
simplify = F)
sim_list3 = replicate(n = numReplications,
expr = {data.frame(first = base::sample(x = firsts, size = 3, replace = TRUE),
second = base::sample(x = seconds, size = 3, replace = TRUE))},
simplify = F)
# Create list
employeesList = c(sim_list1, sim_list2, sim_list3)
# Method 1
system.time(res1 <- dfbycol(employeesList))
# > system.time(dfbycol(employeesList))
# user system elapsed
# 757.87 0.18 758.62
# res1
rm(res1)
#####
# Method 2
system.time(m <- max( unlist( lapply( employeesList, nrow ) ) ))
# user system elapsed
# 0.22 0.00 0.22
system.time(listFinal2 <- lapply( employeesList, myrename2, m = m ) )
listFinal2
# user system elapsed
# 16.16 0.01 16.18
system.time(result2 <- do.call( rbind, listFinal2 ) )
# result2
# user system elapsed
# 3.96 0.00 3.96
rm(listFinal2)
rm(result2)
#####
# Method 3
system.time(listFinal3 <- lapply( employeesList, myrename3))
# user system elapsed
# 7.33 0.00 7.33
listFinal3
system.time(result3 <- dplyr::bind_rows( listFinal3 ))
# user system elapsed
# 0.17 0.00 0.17
rm(listFinal3)
rm(result3)
#####
# Method 4
system.time(listFinal4 <- lapply( employeesList, myrename4) )
# user system elapsed
# 400.05 0.04 400.24
listFinal4
system.time(result4 <- dplyr::bind_rows( listFinal4 ) )
# user system elapsed
# 0.17 0.00 0.17
# result4
Upvotes: 0
Reputation: 886968
Here is one option with gather/spread
library(tidyverse)
employeesList %>%
map_df(~ .x %>%
mutate_all(as.character) %>% # convert columns to character class
mutate(n = row_number(), n = replace(n, n==1, "")),
.id = 'grp') %>%
group_by(grp) %>%
gather(key, val, first:second) %>% # gather to long format
arrange(grp, n) %>%
unite(keyn, key, n, sep="") %>% # unite columns to create new column
ungroup %>%
mutate(keyn = factor(keyn, levels = unique(keyn))) %>% # for column order
spread(keyn, val) %>% # spread to wide format
select(-grp)
# A tibble: 4 x 6
# first second first2 second2 first3 second3
# <chr> <chr> <chr> <chr> <chr> <chr>
#1 Al Jones NA NA NA NA
#2 Al Jones Barb Smith NA NA
#3 Al Jones Barb Smith Carol Adams
#4 Al Jones NA NA NA NA
Upvotes: 1
Reputation: 12703
May be this using data.table
library('data.table')
rbindlist( l = lapply( employeesList, function(x) {
dcast( data = melt( setDT( x ), measure.vars = c( 'first', 'second'))[, V1 := seq_along(value), by = variable][],
formula = " . ~ variable + V1")[, -1]
}),
fill = TRUE,
use.names = TRUE )
# first_1 second_1 first_2 second_2 first_3 second_3
# 1: Al Jones NA NA NA NA
# 2: Al Jones Barb Smith NA NA
# 3: Al Jones Barb Smith Carol Adams
# 4: Al Jones NA NA NA NA
Upvotes: 1
Reputation: 388817
We can use lapply
and convert the list into a single row data frame using and then use bind_rows
to bind multiple data frames together.
library(dplyr)
bind_rows(lapply(employeesList, function(x) rbind.data.frame(c(t(x)))))
# X.Al. X.Jones. X.Barb. X.Smith. X.Carol. X.Adams.
#1 Al Jones <NA> <NA> <NA> <NA>
#2 Al Jones Barb Smith <NA> <NA>
#3 Al Jones Barb Smith Carol Adams
#4 Al Jones <NA> <NA> <NA> <NA>
We can rename the columns later using setNames
according to our preference.
Upvotes: 1