Reputation: 51
I have lots of csv files which are mostly the same but some have a different number of columns. I want to load named columns and ignore any others. (all files have a header row with column names)
I am hoping I can do something like this so that when a file has a column called 'Temp' it just gets ignored, but I don't know where to put myCols to get that outcome.
fileList <- dir("thedata")
myCols <- c('Time','Place')
df <- do.call(rbind, lapply(fileList, read.csv))
Upvotes: 2
Views: 1290
Reputation: 270368
1) data.table with xsv Download xsv (releases) and put it on your path and then if Filenames is a character vector with the filenames run the following code. The file column in the result will contain the filename that each row originally came from. If you don't need that the last line can be just rbindlist(L).
library(data.table)
L <- lapply(paste("xsv select Time,Place", Filenames), fread)
rbindlist(setNames(L, Filenames), idcol = "file")
2) Base R Create a function, Read, which reads in the headings and creates a colClasses vector which can be used in read.csv to only read in the Time and Place columns -- if an element of colClasses is NULL then that column will be omitted and if an element is NA then the corresonding column will be read in normally. It also adds a file column to record which file each row came from. Omit the data.frame(...) line in Read if you don't need that. Then combine the files using rbind/do.call.
Read <- function(file) {
cn <- unlist(read.table(file, sep = ",", nrows = 1))
colClasses <- ifelse(cn %in% v("Time", "Place"), NA, "NULL")
dat <- read.csv(file, colClasses = colClasses)
data.frame(file, dat)
}
do.call("rbind", lapply(Filenames, Read))
Upvotes: 0
Reputation: 1991
Not sure about how you would avoid reading the unwanted columns but the plyr
package has a really useful function rbind.fill
which joins data frames together row-wise and fills with missing values if the column is not available.
all_data <- Map(read.csv, fileList)
df <- plyr::rbind.fill(all_data)
Then you can drop the columns you do not want from df.
As others have mentioned, to speed up reading the data you might want to use read_csv
from readr
or fread
from data.table
. As they are both faster than read.csv
.
Upvotes: 0
Reputation: 350
First I created some example data frames to make a list similar to your description.
# Example data frames
df1 <- data.frame(Time = runif(10, 0, 100), Place = letters[1:10])
df2 <- data.frame(Time = runif(10, 0, 100), Place = letters[1:10], Temp = runif(10, 0, 1))
df3 <- data.frame(Time = runif(10, 0, 100), Place = letters[11:20])
# List of data frames
df_l <- list(df1, df2, df3)
Now a possible solution
# Now you can select only the target columns using Map and column names
myCols <- c('Time','Place')
df_l <- Map(function(x){x[,names(x) %in% my_cols]}, df_l)
# now you can simply rbind the data frames (which now all have the same columns)
df <- do.call("rbind", df_l)
head(df)
Time place
1 6.474527 a
2 78.747293 b
3 9.758012 c
4 79.765140 d
5 9.130967 e
6 34.237917 f
Upvotes: 1