Reputation: 5184

How to full outer join multiple csv files into a single data.table in R?

I have been using python for a while now to work with pandas dataframes. I want to switch the same code I am using over to R. However, I don't have much experience with R and I'm not sure what options I have to do the same thing. I have a folder of many csv files, and I have a list of the file names which I would like to iterate over and do a full outer join of the files.

In pandas, I would run the following command,

import pandas as pd
filelist = pd.read_excel("/Users/XXX/Documents/test/data/list.xlsx") #contains a list of filenames in the File column around 8000 rows long

workingdf = pd.DataFrame() #create a blank data frame

for subdir in filelist.File:
    df = pd.read_csv(f"/Users/XXX/Documents/test/data/{subdir}",index_col=0) #open file 2 columns named DATE and VALUE
    df = df.rename(columns={"VALUE":subdir}) #rename the value column to the file name
    workingdf = pd.merge(workingdf,df,how='outer',left_index=True,right_index=True)

The last part of the code merges all the dataframes in the dictionary into one big data frame using a full outer join.

I want to take a bunch of csv files I have in a folder, iterate over a list of file names, create data tables out of the files, and append them one by one to make one big data table in R.

I am trying to switch to R because I am running into memory and efficiency problems and someone suggested I give it a try.

How can I run the same instruction set in R? I don't have much experience with it but I have the following code so far.

library(openxlsx)
library(data.table)

filelist <- openxlsx::read.xlsx("/Users/XXX/Documents/test/data/list.xlsx")


for (row in filelist[1]){

}

I know I need to use tempdf = read.csv(paste("/Users/XXX/Documents/test/data/",row,sep"")) to create the path to the file and make the frame.

The renaming function should be along the lines of names(tempdf)[2]<-row

I also know that I have to use workingdf <- merge(workingdf,tempdf,by="DATE",all=TRUE)

But I don't really know where to start on this because I can't create a blank dataframe like in pandas and just start merging them. Any suggestions?

Upvotes: 2

Answers (2)

Len Greski

Reputation: 10855

One can use R functions such as paste0() or sprintf() to build the full path file name for a list of files to be read into R. Here is an example that downloads spreadsheets for the first seven generations of Pokémon, unzips the file, and uses sprintf() to build the path names to read each file.

download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/PokemonXLSX.zip",
               "PokemonXLSX.zip",
               method="curl",mode="wb")
unzip("PokemonXLSX.zip")
library(readxl)
# create a set of numbers to be used to generate file names
# that have leading zeros, e.g. gen01.xlsx, gen02.xlsx, etc. 
generationIds <- 1:7
spreadsheets <- lapply(generationIds,function(x) {
     # use generation number to create individual file name
     aFile <- sprintf("./PokemonData/gen%02i.xlsx",x)
     data <- read_excel(aFile)
     })

In order to merge the files by an ID variable, one can use the Reduce() function as illustrated in another answer.

The following code illustrates how the file names are created.

theFiles <- lapply(generationIds,function(x) {
        # use generation number to create individual file name
        aFile <- sprintf("./PokemonData/gen%02i.xlsx",x)
        message(paste("current file is: ",aFile))
        aFile
})

Upvotes: 1

chinsoon12

Reputation: 25225

An example of full outer join on a list of csvs

library(data.table)
#get list of csv in current folder
l <- list.files(pattern="(.*).csv")

#use data.table::fread to read them and then merge with all=TRUE for full outer join
#Reduce is a recursive function takes prev output to be merged with next input
Reduce(function(x, y) merge(x, y, by="DATE", all=TRUE), lapply(l, fread))

Upvotes: 2

How to full outer join multiple csv files into a single data.table in R?

Answers (2)

Related Questions