Reputation: 5201
Suppose we have a folder containing multiple data.csv files, each containing the same number of variables but each from different times. Is there a way in R to import them all simultaneously rather than having to import them all individually?
My problem is that I have around 2000 data files to import and having to import them individually just by using the code:
read.delim(file="filename", header=TRUE, sep="\t")
is not very efficient.
Upvotes: 333
Views: 543648
Reputation: 4340
With many files and many cores, fread xargs cat
(described below) is about 50x faster than the fastest solution in the top 3 answers.
rbindlist lapply read.delim 500s <- 1st place & accepted answer
rbindlist lapply fread 250s <- 2nd & 3rd place answers
rbindlist mclapply fread 10s
fread xargs cat 5s
Time to read 121401 csvs into a single data.table. Each time is an average of three runs then rounded. Each csv has 3 columns, one header row, and, on average, 4.510 rows. Machine is a GCP VM with 96 cores.
The top three answers by @A5C1D2H2I1M1N2O1R2T1, @leerssej, and @marbel and are all essentially the same: apply fread (or read.delim) to each file, then rbind/rbindlist the resulting data.tables. For small datasets, I usually use the rbindlist(lapply(list.files("*.csv"),fread))
form. For medium-sized datasets, I use parallel's mclapply instead of lapply, which is much faster if you have many cores.
This is better than other R-internal alternatives, but not the best for a large number of small csvs when speed matters. In that case, it can be much faster to first use cat
to concatenate all the csvs into one csv, as in @Spacedman's answer. I'll add some detail on how to do this from within R:
x = fread(cmd='cat *.csv', header=F)
However, what if each csv has a header?
x = fread(cmd="awk 'NR==1||FNR!=1' *.csv", header=T)
And what if you have so many files that the *.csv
shell glob fails?
x = fread(cmd='find . -name "*.csv" | xargs cat', header=F)
And what if all files have a header AND there are too many files?
header = fread(cmd='find . -name "*.csv" | head -n1 | xargs head -n1', header=T)
x = fread(cmd='find . -name "*.csv" | xargs tail -q -n+2', header=F)
setnames(x,header)
And what if the resulting concatenated csv is too big for system memory? (e.g., /dev/shm out of space error)
system('find . -name "*.csv" | xargs cat > combined.csv')
x = fread('combined.csv', header=F)
With headers?
system('find . -name "*.csv" | head -n1 | xargs head -n1 > combined.csv')
system('find . -name "*.csv" | xargs tail -q -n+2 >> combined.csv')
x = fread('combined.csv', header=T)
Finally, what if you don't want all .csv in a directory, but rather a specific set of files? (Also, they all have headers.) (This is my use case.)
fread(text=paste0(system("xargs cat|awk 'NR==1||$1!=\"<column one name>\"'",input=paths,intern=T),collapse="\n"),header=T,sep="\t")
and this is about the same speed as plain fread xargs cat :)
Note: for data.table pre-v1.11.6 (19 Sep 2018), omit the cmd=
from fread(cmd=
.
To sum up, if you're interested in speed, and have many files and many cores, fread xargs cat is about 50x faster than the fastest solution in the top 3 answers.
Update: here is a function I wrote to easily apply the fastest solution. I use it in production in several situations, but you should test it thoroughly with your own data before trusting it.
fread_many = function(files,header=T,...){
if(length(files)==0) return()
if(typeof(files)!='character') return()
files = files[file.exists(files)]
if(length(files)==0) return()
tmp = tempfile(fileext = ".csv")
# note 1: requires awk, not cat or tail because some files have no final newline
# note 2: parallel --xargs is 40% slower
# note 3: reading to var is 15% slower and crashes R if the string is too long
# note 4: shorter paths -> more paths per awk -> fewer awks -> measurably faster
# so best cd to the csv dir and use relative paths
if(header==T){
system(paste0('head -n1 ',files[1],' > ',tmp))
system(paste0("xargs awk 'FNR>1' >> ",tmp),input=files)
} else {
system(paste0("xargs awk '1' > ",tmp),input=files)
}
DT = fread(file=tmp,header=header,...)
file.remove(tmp)
DT
}
Update 2: here is a more complicated version of the fread_many function for cases where you want the resulting data.table to include a column for the inpath of each csv. In this case, one must also explicitly specify the csv separator with the sep argument.
fread_many = function(files,header=T,keep_inpath=F,sep="auto",...){
if(length(files)==0) return()
if(typeof(files)!='character') return()
files = files[file.exists(files)]
if(length(files)==0) return()
tmp = tempfile(fileext = ".csv")
if(keep_inpath==T){
stopifnot(sep!="auto")
if(header==T){
system(paste0('/usr/bin/echo -ne inpath"',sep,'" > ',tmp))
system(paste0('head -n1 ',files[1],' >> ',tmp))
system(paste0("xargs awk -vsep='",sep,"' 'BEGIN{OFS=sep}{if(FNR>1)print FILENAME,$0}' >> ",tmp),input=files)
} else {
system(paste0("xargs awk -vsep='",sep,"' 'BEGIN{OFS=sep}{print FILENAME,$0}' > ",tmp),input=files)
}
} else {
if(header==T){
system(paste0('head -n1 ',files[1],' > ',tmp))
system(paste0("xargs awk 'FNR>1' >> ",tmp),input=files)
} else {
system(paste0("xargs awk '1' > ",tmp),input=files)
}
}
DT = fread(file=tmp,header=header,sep=sep,...)
file.remove(tmp)
DT
}
Caveat: all of my solutions that concatenate the csvs before reading them assumes they all have the same separator. If not all of your csvs use the same separator, instead use rbindlist lapply fread, rbindlist mclapply fread, or fread xargs cat in batches, where all csvs in a batch use the same separator.
Upvotes: 42
Reputation: 14958
A speedy and succinct tidyverse
solution:
(more than twice as fast as Base R's read.csv
)
tbl <-
list.files(pattern = "\\.csv$") %>%
map_df(~read_csv(.))
and data.table's fread()
can even cut those load times by half again. (for 1/4 the Base R times)
library(data.table)
tbl_fread <-
list.files(pattern = "\\.csv$") %>%
map_df(~fread(.))
The stringsAsFactors = FALSE
argument keeps the dataframe factor free, (and as marbel points out, is the default setting for fread
)
If the typecasting is being cheeky, you can force all the columns to be as characters with the col_types
argument.
tbl <-
list.files(pattern = "\\.csv$") %>%
map_df(~read_csv(., col_types = cols(.default = "c")))
If you are wanting to dip into subdirectories to construct your list of files to eventually bind, then be sure to include the path name, as well as register the files with their full names in your list. This will allow the binding work to go on outside of the current directory. (Thinking of the full pathnames as operating like passports to allow movement back across directory 'borders'.)
tbl <-
list.files(path = "./subdirectory/",
pattern = "\\.csv$",
full.names = T) %>%
map_df(~read_csv(., col_types = cols(.default = "c")))
As Hadley describes here (about halfway down):
map_df(x, f)
is effectively the same asdo.call("rbind", lapply(x, f))
....
Bonus Feature - adding filenames to the records per Niks feature request in comments below:
filename
to each record.Code explained: make a function to append the filename to each record during the initial reading of the tables. Then use that function instead of the simple read_csv()
function.
read_plus <- function(flnm) {
read_csv(flnm) %>%
mutate(filename = flnm)
}
tbl_with_sources <-
list.files(pattern = "\\.csv$",
full.names = T) %>%
map_df(~read_plus(.))
(The typecasting and subdirectory handling approaches can also be handled inside the read_plus()
function in the same manner as illustrated in the second and third variants suggested above.)
### Benchmark Code & Results
library(tidyverse)
library(data.table)
library(microbenchmark)
### Base R Approaches
#### Instead of a dataframe, this approach creates a list of lists
#### removed from analysis as this alone doubled analysis time reqd
# lapply_read.delim <- function(path, pattern = "\\.csv$") {
# temp = list.files(path, pattern, full.names = TRUE)
# myfiles = lapply(temp, read.delim)
# }
#### `read.csv()`
do.call_rbind_read.csv <- function(path, pattern = "\\.csv$") {
files = list.files(path, pattern, full.names = TRUE)
do.call(rbind, lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE)))
}
map_df_read.csv <- function(path, pattern = "\\.csv$") {
list.files(path, pattern, full.names = TRUE) %>%
map_df(~read.csv(., stringsAsFactors = FALSE))
}
### *dplyr()*
#### `read_csv()`
lapply_read_csv_bind_rows <- function(path, pattern = "\\.csv$") {
files = list.files(path, pattern, full.names = TRUE)
lapply(files, read_csv) %>% bind_rows()
}
map_df_read_csv <- function(path, pattern = "\\.csv$") {
list.files(path, pattern, full.names = TRUE) %>%
map_df(~read_csv(., col_types = cols(.default = "c")))
}
### *data.table* / *purrr* hybrid
map_df_fread <- function(path, pattern = "\\.csv$") {
list.files(path, pattern, full.names = TRUE) %>%
map_df(~fread(.))
}
### *data.table*
rbindlist_fread <- function(path, pattern = "\\.csv$") {
files = list.files(path, pattern, full.names = TRUE)
rbindlist(lapply(files, function(x) fread(x)))
}
do.call_rbind_fread <- function(path, pattern = "\\.csv$") {
files = list.files(path, pattern, full.names = TRUE)
do.call(rbind, lapply(files, function(x) fread(x, stringsAsFactors = FALSE)))
}
read_results <- function(dir_size){
microbenchmark(
# lapply_read.delim = lapply_read.delim(dir_size), # too slow to include in benchmarks
do.call_rbind_read.csv = do.call_rbind_read.csv(dir_size),
map_df_read.csv = map_df_read.csv(dir_size),
lapply_read_csv_bind_rows = lapply_read_csv_bind_rows(dir_size),
map_df_read_csv = map_df_read_csv(dir_size),
rbindlist_fread = rbindlist_fread(dir_size),
do.call_rbind_fread = do.call_rbind_fread(dir_size),
map_df_fread = map_df_fread(dir_size),
times = 10L)
}
read_results_lrg_mid_mid <- read_results('./testFolder/500MB_12.5MB_40files')
print(read_results_lrg_mid_mid, digits = 3)
read_results_sml_mic_mny <- read_results('./testFolder/5MB_5KB_1000files/')
read_results_sml_tny_mod <- read_results('./testFolder/5MB_50KB_100files/')
read_results_sml_sml_few <- read_results('./testFolder/5MB_500KB_10files/')
read_results_med_sml_mny <- read_results('./testFolder/50MB_5OKB_1000files')
read_results_med_sml_mod <- read_results('./testFolder/50MB_5OOKB_100files')
read_results_med_med_few <- read_results('./testFolder/50MB_5MB_10files')
read_results_lrg_sml_mny <- read_results('./testFolder/500MB_500KB_1000files')
read_results_lrg_med_mod <- read_results('./testFolder/500MB_5MB_100files')
read_results_lrg_lrg_few <- read_results('./testFolder/500MB_50MB_10files')
read_results_xlg_lrg_mod <- read_results('./testFolder/5000MB_50MB_100files')
print(read_results_sml_mic_mny, digits = 3)
print(read_results_sml_tny_mod, digits = 3)
print(read_results_sml_sml_few, digits = 3)
print(read_results_med_sml_mny, digits = 3)
print(read_results_med_sml_mod, digits = 3)
print(read_results_med_med_few, digits = 3)
print(read_results_lrg_sml_mny, digits = 3)
print(read_results_lrg_med_mod, digits = 3)
print(read_results_lrg_lrg_few, digits = 3)
print(read_results_xlg_lrg_mod, digits = 3)
# display boxplot of my typical use case results & basic machine max load
par(oma = c(0,0,0,0)) # remove overall margins if present
par(mfcol = c(1,1)) # remove grid if present
par(mar = c(12,5,1,1) + 0.1) # to display just a single boxplot with its complete labels
boxplot(read_results_lrg_mid_mid, las = 2, xlab = "", ylab = "Duration (seconds)", main = "40 files @ 12.5MB (500MB)")
boxplot(read_results_xlg_lrg_mod, las = 2, xlab = "", ylab = "Duration (seconds)", main = "100 files @ 50MB (5GB)")
# generate 3x3 grid boxplots
par(oma = c(12,1,1,1)) # margins for the whole 3 x 3 grid plot
par(mfcol = c(3,3)) # create grid (filling down each column)
par(mar = c(1,4,2,1)) # margins for the individual plots in 3 x 3 grid
boxplot(read_results_sml_mic_mny, las = 2, xlab = "", ylab = "Duration (seconds)", main = "1000 files @ 5KB (5MB)", xaxt = 'n')
boxplot(read_results_sml_tny_mod, las = 2, xlab = "", ylab = "Duration (milliseconds)", main = "100 files @ 50KB (5MB)", xaxt = 'n')
boxplot(read_results_sml_sml_few, las = 2, xlab = "", ylab = "Duration (milliseconds)", main = "10 files @ 500KB (5MB)",)
boxplot(read_results_med_sml_mny, las = 2, xlab = "", ylab = "Duration (microseconds) ", main = "1000 files @ 50KB (50MB)", xaxt = 'n')
boxplot(read_results_med_sml_mod, las = 2, xlab = "", ylab = "Duration (microseconds)", main = "100 files @ 500KB (50MB)", xaxt = 'n')
boxplot(read_results_med_med_few, las = 2, xlab = "", ylab = "Duration (seconds)", main = "10 files @ 5MB (50MB)")
boxplot(read_results_lrg_sml_mny, las = 2, xlab = "", ylab = "Duration (seconds)", main = "1000 files @ 500KB (500MB)", xaxt = 'n')
boxplot(read_results_lrg_med_mod, las = 2, xlab = "", ylab = "Duration (seconds)", main = "100 files @ 5MB (500MB)", xaxt = 'n')
boxplot(read_results_lrg_lrg_few, las = 2, xlab = "", ylab = "Duration (seconds)", main = "10 files @ 50MB (500MB)")
Rows: file counts (1000, 100, 10)
Columns: final dataframe size (5MB, 50MB, 500MB)
(click on image to view original size)
The base R results are better for the smallest use cases where the overhead of bringing the C libraries of purrr and dplyr to bear outweigh the performance gains that are observed when performing larger scale processing tasks.
if you want to run your own tests you may find this bash script helpful.
for ((i=1; i<=$2; i++)); do
cp "$1" "${1:0:8}_${i}.csv";
done
bash what_you_name_this_script.sh "fileName_you_want_copied" 100
will create 100 copies of your file sequentially numbered (after the initial 8 characters of the filename and an underscore).
With special thanks to:
map_df()
here.fread()
. (I need to study up on data.table
.)Upvotes: 256
Reputation: 193497
Something like the following should result in each data frame as a separate element in a single list:
temp = list.files(pattern="\\.csv$")
myfiles = lapply(temp, read.delim)
This assumes that you have those CSVs in a single directory--your current working directory--and that all of them have the lower-case extension .csv
.
If you then want to combine those data frames into a single data frame, see the solutions in other answers using things like do.call(rbind,...)
, dplyr::bind_rows()
or data.table::rbindlist()
.
If you really want each data frame in a separate object, even though that's often inadvisable, you could do the following with assign
:
temp = list.files(pattern="\\.csv$")
for (i in 1:length(temp)) assign(temp[i], read.csv(temp[i]))
Or, without assign
, and to demonstrate (1) how the file name can be cleaned up and (2) show how to use list2env
, you can try the following:
temp = list.files(pattern="\\.csv$")
list2env(
lapply(setNames(temp, make.names(gsub("\\.csv$", "", temp))),
read.csv), envir = .GlobalEnv)
But again, it's often better to leave them in a single list.
Upvotes: 385
Reputation: 7984
An update to @leerssej's "speedy and succinct" tidyverse solution, since map_df
has been superseded:
tbl <-
list.files(pattern = "*.csv") |>
map((\(fn) read_csv(fn)) |>
list_rbind()
Upvotes: 2
Reputation: 133
With readr 2.0.0 onwards, you can read multiple files in at once simply by providing a list of their paths to the file
argument. Here is an example showing this with readr::read_csv()
.
packageVersion("readr")
#> [1] '2.0.1'
library(readr)
library(fs)
# create files to read in
write_csv(read_csv("1, 2 \n 3, 4", col_names = c("x", "y")), file = "file1.csv")
write_csv(read_csv("5, 6 \n 7, 8", col_names = c("x", "y")), file = "file2.csv")
# get a list of files
files <- dir_ls(".", glob = "file*csv")
files
#> file1.csv file2.csv
# read them in at once
# record paths in a column called filename
read_csv(files, id = "filename")
#> # A tibble: 4 × 3
#> filename x y
#> <chr> <dbl> <dbl>
#> 1 file1.csv 1 2
#> 2 file1.csv 3 4
#> 3 file2.csv 5 6
#> 4 file2.csv 7 8
Created on 2021-09-16 by the reprex package (v2.0.1)
Upvotes: 3
Reputation:
In my view, most of the other answers are obsoleted by rio::import_list
, which is a succinct one-liner:
library(rio)
my_data <- import_list(dir("path_to_directory", pattern = ".csv"), rbind = TRUE)
Any extra arguments are passed to rio::import
. rio
can deal with almost any file format R can read, and it uses data.table
's fread
where possible, so it should be fast too.
Upvotes: 18
Reputation: 4168
Using purrr
and including file IDs as a column:
library(tidyverse)
p <- "my/directory"
files <- list.files(p, pattern="csv", full.names=TRUE) %>%
set_names()
merged <- files %>% map_dfr(read_csv, .id="filename")
Without set_names()
, .id=
will use integer indicators, instead of actual file names.
If you then want just the short filename without the full path:
merged <- merged %>% mutate(filename=basename(filename))
Upvotes: 6
Reputation: 119
The following codes should give you the fastest speed for big data as long as you have many cores on your computer:
if (!require("pacman")) install.packages("pacman")
pacman::p_load(doParallel, data.table, stringr)
# get the file name
dir() %>% str_subset("\\.csv$") -> fn
# use parallel setting
(cl <- detectCores() %>%
makeCluster()) %>%
registerDoParallel()
# read and bind all files together
system.time({
big_df <- foreach(
i = fn,
.packages = "data.table"
) %dopar%
{
fread(i, colClasses = "character")
} %>%
rbindlist(fill = TRUE)
})
# end of parallel work
stopImplicitCluster(cl)
Updated in 2020/04/16: As I find a new package available for parallel computation, an alternative solution is provided using the following codes.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(future.apply, data.table, stringr)
# get the file name
dir() %>% str_subset("\\.csv$") -> fn
plan(multiprocess)
future_lapply(fn,fread,colClasses = "character") %>%
rbindlist(fill = TRUE) -> res
# res is the merged data.table
Upvotes: 2
Reputation: 9687
It was requested that I add this functionality to the stackoverflow R package. Given that it is a tinyverse package (and can't depend on third party packages), here is what I came up with:
#' Bulk import data files
#'
#' Read in each file at a path and then unnest them. Defaults to csv format.
#'
#' @param path a character vector of full path names
#' @param pattern an optional \link[=regex]{regular expression}. Only file names which match the regular expression will be returned.
#' @param reader a function that can read data from a file name.
#' @param ... optional arguments to pass to the reader function (eg \code{stringsAsFactors}).
#' @param reducer a function to unnest the individual data files. Use I to retain the nested structure.
#' @param recursive logical. Should the listing recurse into directories?
#'
#' @author Neal Fultz
#' @references \url{https://stackoverflow.com/questions/11433432/how-to-import-multiple-csv-files-at-once}
#'
#' @importFrom utils read.csv
#' @export
read.directory <- function(path='.', pattern=NULL, reader=read.csv, ...,
reducer=function(dfs) do.call(rbind.data.frame, dfs), recursive=FALSE) {
files <- list.files(path, pattern, full.names = TRUE, recursive = recursive)
reducer(lapply(files, reader, ...))
}
By parameterizing the reader and reducer function, people can use data.table or dplyr if they so choose, or just use the base R functions that are fine for smaller data sets.
Upvotes: 2
Reputation: 141
This is my specific example to read multiple files and combine them into 1 data frame:
path<- file.path("C:/folder/subfolder")
files <- list.files(path=path, pattern="/*.csv",full.names = T)
library(data.table)
data = do.call(rbind, lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE)))
Upvotes: 4
Reputation: 7714
Here are some options to convert the .csv files into one data.frame using R base and some of the available packages for reading files in R.
This is slower than the options below.
# Get the files names
files = list.files(pattern="*.csv")
# First apply read.csv, then rbind
myfiles = do.call(rbind, lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE)))
Edit: - A few more extra choices using data.table
and readr
A fread()
version, which is a function of the data.table
package. This is by far the fastest option in R.
library(data.table)
DT = do.call(rbind, lapply(files, fread))
# The same using `rbindlist`
DT = rbindlist(lapply(files, fread))
Using readr, which is another package for reading csv files. It's slower than fread
, faster than base R but has different functionalities.
library(readr)
library(dplyr)
tbl = lapply(files, read_csv) %>% bind_rows()
Upvotes: 122
Reputation: 119
I like the approach using list.files()
, lapply()
and list2env()
(or fs::dir_ls()
, purrr::map()
and list2env()
). That seems simple and flexible.
Alternatively, you may try the small package {tor} (to-R): By default it imports files from the working directory into a list (list_*()
variants) or into the global environment (load_*()
variants).
For example, here I read all the .csv files from my working directory into a list using tor::list_csv()
:
library(tor)
dir()
#> [1] "_pkgdown.yml" "cran-comments.md" "csv1.csv"
#> [4] "csv2.csv" "datasets" "DESCRIPTION"
#> [7] "docs" "inst" "LICENSE.md"
#> [10] "man" "NAMESPACE" "NEWS.md"
#> [13] "R" "README.md" "README.Rmd"
#> [16] "tests" "tmp.R" "tor.Rproj"
list_csv()
#> $csv1
#> x
#> 1 1
#> 2 2
#>
#> $csv2
#> y
#> 1 a
#> 2 b
And now I load those files into my global environment with tor::load_csv()
:
# The working directory contains .csv files
dir()
#> [1] "_pkgdown.yml" "cran-comments.md" "CRAN-RELEASE"
#> [4] "csv1.csv" "csv2.csv" "datasets"
#> [7] "DESCRIPTION" "docs" "inst"
#> [10] "LICENSE.md" "man" "NAMESPACE"
#> [13] "NEWS.md" "R" "README.md"
#> [16] "README.Rmd" "tests" "tmp.R"
#> [19] "tor.Rproj"
load_csv()
# Each file is now available as a dataframe in the global environment
csv1
#> x
#> 1 1
#> 2 2
csv2
#> y
#> 1 a
#> 2 b
Should you need to read specific files, you can match their file-path with regexp
, ignore.case
and invert
.
For even more flexibility use list_any()
. It allows you to supply the reader function via the argument .f
.
(path_csv <- tor_example("csv"))
#> [1] "C:/Users/LeporeM/Documents/R/R-3.5.2/library/tor/extdata/csv"
dir(path_csv)
#> [1] "file1.csv" "file2.csv"
list_any(path_csv, read.csv)
#> $file1
#> x
#> 1 1
#> 2 2
#>
#> $file2
#> y
#> 1 a
#> 2 b
Pass additional arguments via ... or inside the lambda function.
path_csv %>%
list_any(readr::read_csv, skip = 1)
#> Parsed with column specification:
#> cols(
#> `1` = col_double()
#> )
#> Parsed with column specification:
#> cols(
#> a = col_character()
#> )
#> $file1
#> # A tibble: 1 x 1
#> `1`
#> <dbl>
#> 1 2
#>
#> $file2
#> # A tibble: 1 x 1
#> a
#> <chr>
#> 1 b
path_csv %>%
list_any(~read.csv(., stringsAsFactors = FALSE)) %>%
map(as_tibble)
#> $file1
#> # A tibble: 2 x 1
#> x
#> <int>
#> 1 1
#> 2 2
#>
#> $file2
#> # A tibble: 2 x 1
#> y
#> <chr>
#> 1 a
#> 2 b
Upvotes: 1
Reputation: 4367
Using plyr::ldply
there is roughly a 50% speed increase by enabling the .parallel
option while reading 400 csv files roughly 30-40 MB each. Example includes a text progress bar.
library(plyr)
library(data.table)
library(doSNOW)
csv.list <- list.files(path="t:/data", pattern=".csv$", full.names=TRUE)
cl <- makeCluster(4)
registerDoSNOW(cl)
pb <- txtProgressBar(max=length(csv.list), style=3)
pbu <- function(i) setTxtProgressBar(pb, i)
dt <- setDT(ldply(csv.list, fread, .parallel=TRUE, .paropts=list(.options.snow=list(progress=pbu))))
stopCluster(cl)
Upvotes: 7
Reputation: 77
Building on dnlbrk's comment, assign can be considerably faster than list2env for big files.
library(readr)
library(stringr)
List_of_file_paths <- list.files(path ="C:/Users/Anon/Documents/Folder_with_csv_files/", pattern = ".csv", all.files = TRUE, full.names = TRUE)
By setting the full.names argument to true, you will get the full path to each file as a separate character string in your list of files, e.g., List_of_file_paths[1] will be something like "C:/Users/Anon/Documents/Folder_with_csv_files/file1.csv"
for(f in 1:length(List_of_filepaths)) {
file_name <- str_sub(string = List_of_filepaths[f], start = 46, end = -5)
file_df <- read_csv(List_of_filepaths[f])
assign( x = file_name, value = file_df, envir = .GlobalEnv)
}
You could use the data.table package's fread or base R read.csv instead of read_csv. The file_name step allows you to tidy up the name so that each data frame does not remain with the full path to the file as it's name. You could extend your loop to do further things to the data table before transferring it to the global environment, for example:
for(f in 1:length(List_of_filepaths)) {
file_name <- str_sub(string = List_of_filepaths[f], start = 46, end = -5)
file_df <- read_csv(List_of_filepaths[f])
file_df <- file_df[,1:3] #if you only need the first three columns
assign( x = file_name, value = file_df, envir = .GlobalEnv)
}
Upvotes: 4
Reputation: 395
This is the code I developed to read all csv files into R. It will create a dataframe for each csv file individually and title that dataframe the file's original name (removing spaces and the .csv) I hope you find it useful!
path <- "C:/Users/cfees/My Box Files/Fitness/"
files <- list.files(path=path, pattern="*.csv")
for(file in files)
{
perpos <- which(strsplit(file, "")[[1]]==".")
assign(
gsub(" ","",substr(file, 1, perpos-1)),
read.csv(paste(path,file,sep="")))
}
Upvotes: 27
Reputation: 94162
As well as using lapply
or some other looping construct in R you could merge your CSV files into one file.
In Unix, if the files had no headers, then its as easy as:
cat *.csv > all.csv
or if there are headers, and you can find a string that matches headers and only headers (ie suppose header lines all start with "Age"), you'd do:
cat *.csv | grep -v ^Age > all.csv
I think in Windows you could do this with COPY
and SEARCH
(or FIND
or something) from the DOS command box, but why not install cygwin
and get the power of the Unix command shell?
Upvotes: 29