Reputation: 279
I have 70 .csv files at 10GB each, they are all split files from the original file, hence each file after the first has for column names the next row from the last of the first file.
I want to read multiple csv files while assigning the column names of the first file to the next that are read. I have tried vroom
and readr
however they give the wrong column length, data.table::fread
seems to be the only one working however it does not allow the reading of multiple files at once unless within a looping function.
Here is what I have tried for reading multiple files:
lapply(files[1:2] ,fread( select=c("family", "species", "occurrenceStatus", "individualCount", "decimalLatitude", "decimalLongitude", "eventDate", "year")))
but I get this error:
Error in fread(select = c("family", "species", "occurrenceStatus", "individualCount", : Input is empty or only contains BOM or terminal control characters
Although it works when removing the select
function, but I wish to only obtain 8/50 columns. Which I can manage afterwards, perhaps within a function but it takes far too much time and memory if I include files >5.
I have also tried:
species <- rbindlist(Map(fread, file = files[1:2],
select = c("family", "species", "occurrenceStatus", "individualCount", "decimalLatitude", "decimalLongitude", "eventDate", "year")))
Which gives this error:
Error in rbindlist(Map(fread, file = flt, select = c("family", "species", : Class attribute on column 1 of item 7 does not match with column 1 of item 1.
Because the column names after the first file have different names as mentioned. Any ideas on how to approach this effectively?
Upvotes: 0
Views: 344
Reputation: 388817
You may try this -
library(data.table)
#define the column names in the data
cols <- c("family", "species", "occurrenceStatus", "individualCount",
"decimalLatitude", "decimalLongitude", "eventDate", "year")
#define the column numbers
col_num <- c(8, 10, 19, 20, 22, 23, 30, 33)
#Read the 1st file with correct column names
data <- fread(files[1], select = col_num)
#Read all the files from 2nd filename without header and
#assign them column names using setNames
#combine the data together with rbindlist
result <- rbindlist(lapply(files[-1], function(x) setNames(
fread(x, select = col_num, header = TRUE), cols)), fill = TRUE)
#Add 1st dataset to rest of them
result <- rbind(data, result)
However, I am skeptical if this will work without giving you any memory error if each of the file is as big as 10GB.
Upvotes: 1