Reputation: 67
I have a folder with a couple hundred .csv files that I'd like to import and merge.
Each file contains two columns of data, but there are different numbers of rows, and the rows have different names. The columns don't have names (For this, let's say they're named x
and y
).
How can I merge these all together? I'd like to just stick the x
columns together, side-by-side, rather than matching on any criteria so that the first row is matched across all data sets and empty rows are given NA.
I'd like column x
to go away.
Although, the rows should stay in the order they were originally in from the csv.
Here's an example:
Data frame 112_c1.csv:
x y
1 -0.5604
3 -0.2301
4 1.5587
5 0.0705
6 0.1292
Dataframe 112_c2.csv:
x y
2 -0.83476
3 -0.82764
8 1.32225
9 0.36363
13 0.9373
42 -1.5567
50 -0.12237
51 -0.4837
Dataframe 113_c1.csv:
x y
5 1.5783
6 0.7736
9 0.28273
15 1.44565
23 0.999878
29 -0.223756
=
Desired result
112_c1.y 112_c2.y 113_c1.y
-0.5604 -0.83476 1.5783
-0.2301 -0.82764 0.7736
1.5587 1.32225 0.28273
0.0705 0.36363 1.44565
0.1292 0.9373 0.999878
NA -1.5567 -0.223756
NA -0.12237 -0.223756
NA -0.12237 NA
NA -0.4837 NA
I've tried a few things, and looked through many other threads. But code like the following simply produces NAs for any following columns:
df <- do.call(rbind.fill, lapply(list.files(pattern = "*.csv"), read.csv))
Plus, if I use rbind
instead of rbind.fill
I get the error that names do not match previous names
and I'm unsure of how to circumvent this matching criteria.
Upvotes: 0
Views: 1560
Reputation: 525
Here's a solution to read all your csv files from a folder called "data" and merge the y columns into a single dataframe. This assigns the file name as the column header.
library(tidyverse)
# store csv file paths
data_path <- "data" # path to the data
files <- dir(data_path, pattern = "*.csv") # get file names
files <- paste(data_path, '/', files, sep="")
# read csv files and combine into a single dataframe
compiled_data = tibble::tibble(File = files) %>% #create a tibble called compiled_data
tidyr::extract(File, "name", "(?<=data/)(.*)(?=[.]csv)", remove = FALSE) %>% #extract the file names
mutate(Data = lapply(File, readr::read_csv, col_names = F)) %>% #create a column called Data that stores the file names
tidyr::unnest(Data) %>% #unnest the Data column into multiple columns
select(-File) %>% #remove the File column
na.omit() %>% #remove the NA rows
spread(name, X2) %>% #reshape the dataframe from long to wide
select(-X1) %>% #remove the x column
mutate_all(funs(.[order(is.na(.))])) #reorganize dataframe to collapse the NA rows
Upvotes: 1
Reputation: 6483
Suggested solution using a function to calculate summary statistics right when loading data:
readCalc <- function(file_path) {
df <- read.csv(file_path)
return(data.frame(file=file_path,
column = names(df),
averages = apply(df, 2, mean),
N = apply(df, 2, length),
min = apply(df, 2, min),
stringsAsFactors = FALSE, row.names = NULL))
}
df <- do.call(rbind, lapply(list.files(pattern = "*.csv"), readCalc))
If we need the first or last value we could use dplyr::first
, dplyr::last
. We might even want to store the whole vector in a list somewhere, but if we only need the summary stats we might not even need it.
Upvotes: 1
Reputation: 663
Taken from here: cbind a dataframe with an empty dataframe - cbind.fill?
x <- c(1:6)
y <- c(1:3)
z <- c(1:10)
cbind.fill <- function(...){
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function (x)
rbind(x, matrix(, n-nrow(x), ncol(x)))))
}
df <- as.data.frame(cbind.fill(x,y,z))
colnames(df) <- c("112_c1.y", "112_c2.y", "113_c1.y")
112_c1.y 112_c2.y 113_c1.y
1 1 1 1
2 2 2 2
3 3 3 3
4 4 NA 4
5 5 NA 5
6 6 NA 6
7 NA NA 7
8 NA NA 8
9 NA NA 9
10 NA NA 10
Upvotes: 0