Stackbeans
Stackbeans

Reputation: 279

Read multiple .csv files and assign column headers

I have 70 .csv files at 10GB each, they are all split files from the original file, hence each file after the first has for column names the next row from the last of the first file.

I want to read multiple csv files while assigning the column names of the first file to the next that are read. I have tried vroom and readr however they give the wrong column length, data.table::fread seems to be the only one working however it does not allow the reading of multiple files at once unless within a looping function.

Here is what I have tried for reading multiple files:

lapply(files[1:2] ,fread( select=c("family", "species", "occurrenceStatus", "individualCount", "decimalLatitude", "decimalLongitude", "eventDate", "year")))

but I get this error:

Error in fread(select = c("family", "species", "occurrenceStatus", "individualCount", : Input is empty or only contains BOM or terminal control characters

Although it works when removing the select function, but I wish to only obtain 8/50 columns. Which I can manage afterwards, perhaps within a function but it takes far too much time and memory if I include files >5.

I have also tried:

species <- rbindlist(Map(fread, file = files[1:2],
              select = c("family", "species", "occurrenceStatus", "individualCount", "decimalLatitude", "decimalLongitude", "eventDate", "year")))

Which gives this error:

Error in rbindlist(Map(fread, file = flt, select = c("family", "species", : Class attribute on column 1 of item 7 does not match with column 1 of item 1.

Because the column names after the first file have different names as mentioned. Any ideas on how to approach this effectively?

Upvotes: 0

Views: 344

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388817

You may try this -

library(data.table)

#define the column names in the data
cols <- c("family", "species", "occurrenceStatus", "individualCount", 
          "decimalLatitude", "decimalLongitude", "eventDate", "year")
#define the column numbers
col_num <- c(8, 10, 19, 20, 22, 23, 30, 33)
#Read the 1st file with correct column names
data <- fread(files[1], select = col_num)
#Read all the files from 2nd filename without header and 
#assign them column names using setNames
#combine the data together with rbindlist
result <- rbindlist(lapply(files[-1], function(x) setNames(
            fread(x, select = col_num, header = TRUE), cols)), fill = TRUE)
#Add 1st dataset to rest of them
result <- rbind(data, result)

However, I am skeptical if this will work without giving you any memory error if each of the file is as big as 10GB.

Upvotes: 1

Related Questions