socialresearcher
socialresearcher

Reputation: 67

Importing and combining multiple CSV files in R with differing numbers and names of rows

I have a folder with a couple hundred .csv files that I'd like to import and merge. Each file contains two columns of data, but there are different numbers of rows, and the rows have different names. The columns don't have names (For this, let's say they're named x and y).

How can I merge these all together? I'd like to just stick the x columns together, side-by-side, rather than matching on any criteria so that the first row is matched across all data sets and empty rows are given NA. I'd like column x to go away. Although, the rows should stay in the order they were originally in from the csv.

Here's an example:

Data frame 112_c1.csv:

x       y
1  -0.5604
3  -0.2301
4   1.5587
5   0.0705
6   0.1292


Dataframe 112_c2.csv:

x         y
2   -0.83476
3   -0.82764
8    1.32225
9    0.36363
13   0.9373
42  -1.5567
50  -0.12237
51  -0.4837

Dataframe 113_c1.csv:

x       y
5   1.5783
6   0.7736
9   0.28273
15  1.44565
23  0.999878
29 -0.223756
=

Desired result

112_c1.y   112_c2.y  113_c1.y
-0.5604   -0.83476   1.5783
-0.2301   -0.82764   0.7736
1.5587     1.32225   0.28273
0.0705     0.36363   1.44565
0.1292     0.9373    0.999878
NA        -1.5567    -0.223756
NA        -0.12237   -0.223756
NA        -0.12237   NA
NA        -0.4837    NA

I've tried a few things, and looked through many other threads. But code like the following simply produces NAs for any following columns:

df <- do.call(rbind.fill, lapply(list.files(pattern = "*.csv"), read.csv))

Plus, if I use rbind instead of rbind.fill I get the error that names do not match previous names and I'm unsure of how to circumvent this matching criteria.

Upvotes: 0

Views: 1560

Answers (3)

Caitlin
Caitlin

Reputation: 525

Here's a solution to read all your csv files from a folder called "data" and merge the y columns into a single dataframe. This assigns the file name as the column header.

library(tidyverse)

# store csv file paths
data_path <- "data"   # path to the data
files <- dir(data_path, pattern = "*.csv") # get file names
files <- paste(data_path, '/', files, sep="")

# read csv files and combine into a single dataframe 
compiled_data = tibble::tibble(File = files) %>% #create a tibble called compiled_data
  tidyr::extract(File, "name", "(?<=data/)(.*)(?=[.]csv)", remove = FALSE) %>% #extract the file names 
  mutate(Data = lapply(File, readr::read_csv, col_names = F)) %>% #create a column called Data that stores the file names 
  tidyr::unnest(Data) %>% #unnest the Data column into multiple columns 
  select(-File) %>% #remove the File column 
  na.omit() %>% #remove the NA rows 
  spread(name, X2) %>% #reshape the dataframe from long to wide 
  select(-X1) %>% #remove the x column 
  mutate_all(funs(.[order(is.na(.))])) #reorganize dataframe to collapse the NA rows 

Upvotes: 1

dario
dario

Reputation: 6483

Suggested solution using a function to calculate summary statistics right when loading data:

 readCalc <- function(file_path) {
   df <- read.csv(file_path)
   return(data.frame(file=file_path,
                     column = names(df),
                     averages = apply(df, 2, mean),
                     N = apply(df, 2, length),
                     min = apply(df, 2, min),
                     stringsAsFactors = FALSE, row.names = NULL))
 }


 df <- do.call(rbind, lapply(list.files(pattern = "*.csv"), readCalc))

If we need the first or last value we could use dplyr::first, dplyr::last. We might even want to store the whole vector in a list somewhere, but if we only need the summary stats we might not even need it.

Upvotes: 1

ecology
ecology

Reputation: 663

Taken from here: cbind a dataframe with an empty dataframe - cbind.fill?

x <- c(1:6)
y <- c(1:3)
z <- c(1:10)

cbind.fill <- function(...){
  nm <- list(...) 
  nm <- lapply(nm, as.matrix)
  n <- max(sapply(nm, nrow)) 
  do.call(cbind, lapply(nm, function (x) 
    rbind(x, matrix(, n-nrow(x), ncol(x))))) 
}

df <- as.data.frame(cbind.fill(x,y,z))

colnames(df) <- c("112_c1.y", "112_c2.y", "113_c1.y")

   112_c1.y 112_c2.y 113_c1.y
1         1        1        1
2         2        2        2
3         3        3        3
4         4       NA        4
5         5       NA        5
6         6       NA        6
7        NA       NA        7
8        NA       NA        8
9        NA       NA        9
10       NA       NA       10

Upvotes: 0

Related Questions