Reputation: 51

How to load multiple csv files with different numbers of columns into an R dataframe?

I have lots of csv files which are mostly the same but some have a different number of columns. I want to load named columns and ignore any others. (all files have a header row with column names)

I am hoping I can do something like this so that when a file has a column called 'Temp' it just gets ignored, but I don't know where to put myCols to get that outcome.

fileList <- dir("thedata")
myCols <- c('Time','Place')
df <- do.call(rbind, lapply(fileList, read.csv))

Upvotes: 2

Answers (3)

G. Grothendieck

Reputation: 270368

1) data.table with xsv Download xsv (releases) and put it on your path and then if Filenames is a character vector with the filenames run the following code. The file column in the result will contain the filename that each row originally came from. If you don't need that the last line can be just rbindlist(L).

library(data.table)

L <- lapply(paste("xsv select Time,Place", Filenames), fread)
rbindlist(setNames(L, Filenames), idcol = "file")

2) Base R Create a function, Read, which reads in the headings and creates a colClasses vector which can be used in read.csv to only read in the Time and Place columns -- if an element of colClasses is NULL then that column will be omitted and if an element is NA then the corresonding column will be read in normally. It also adds a file column to record which file each row came from. Omit the data.frame(...) line in Read if you don't need that. Then combine the files using rbind/do.call.

Read <- function(file) {
  cn <- unlist(read.table(file, sep = ",", nrows = 1))
  colClasses <- ifelse(cn %in% v("Time", "Place"), NA, "NULL")
  dat <- read.csv(file, colClasses = colClasses)
  data.frame(file, dat)
}
do.call("rbind", lapply(Filenames, Read))

Upvotes: 0

Mhairi McNeill

Reputation: 1991

Not sure about how you would avoid reading the unwanted columns but the plyr package has a really useful function rbind.fill which joins data frames together row-wise and fills with missing values if the column is not available.

all_data <- Map(read.csv, fileList)
df <- plyr::rbind.fill(all_data)

Then you can drop the columns you do not want from df.

As others have mentioned, to speed up reading the data you might want to use read_csv from readr or fread from data.table. As they are both faster than read.csv.

Upvotes: 0

FAmorim

Reputation: 350

First I created some example data frames to make a list similar to your description.

# Example data frames
df1 <- data.frame(Time = runif(10, 0, 100), Place = letters[1:10])
df2 <- data.frame(Time = runif(10, 0, 100), Place = letters[1:10], Temp = runif(10, 0, 1))
df3 <- data.frame(Time = runif(10, 0, 100), Place = letters[11:20])

# List of data frames
df_l <- list(df1, df2, df3)

Now a possible solution

# Now you can select only the target columns using Map and column names
myCols <- c('Time','Place')
df_l <- Map(function(x){x[,names(x) %in% my_cols]}, df_l)

# now you can simply rbind the data frames (which now all have the same columns)
df <- do.call("rbind", df_l)

head(df)
       Time place
1  6.474527     a
2 78.747293     b
3  9.758012     c
4 79.765140     d
5  9.130967     e
6 34.237917     f

Upvotes: 1

How to load multiple csv files with different numbers of columns into an R dataframe?

Answers (3)

Related Questions