research111
research111

Reputation: 357

Merge in loop R

I am using a for loop to merge multiple files with another file:

files <- list.files("path", pattern=".TXT", ignore.case=T)

for(i in 1:length(files))
{
  data <- fread(files[i], header=T)

  # Merge
  mydata <- merge(mydata, data, by="ID", all.x=TRUE)

  rm(data)
}

"mydata" looks as follows (simplified):

ID  x1  x2
1   2   8
2   5   5
3   4   4
4   6   5
5   5   8

"data" looks as follows (around 600 files, in total 100GB). Example of 2 (seperate) files. Integrating all in 1 would be impossible (too large):

ID  x3
1   8
2   4

ID  x3
3   4
4   5
5   1

When I run my code I get the following dataset:

ID  x1  x2  x3.x    x3.y
1   2   8   8       NA
2   5   5   4       NA
3   4   4   NA      4
4   6   5   NA      5
5   5   8   NA      1

What I would like to get is:

ID  x1  x2  x3
1   2   8   8
2   5   5   4
3   4   4   4
4   6   5   5
5   5   8   1

ID's are unique (never duplicates over the 600 files).

Any idea on how to achieve this as efficiently as possible much appreciated.

Upvotes: 1

Views: 9977

Answers (1)

Jav
Jav

Reputation: 2313

It's better suited as comment, But I can't comment yet.

Would it not be better to rbind instead of merge? This seems to be what you want to acomplish.

Set fill argument TRUE to take care of different column numbers:

asd <- data.table(x1 = c(1, 2), x2 = c(4, 5))
a <- data.table(x2 = 5)
rbind(asd, a, fill = TRUE)

   x1 x2
1:  1  4
2:  2  5
3: NA  5

Do this with data and then merge into mydata by ID.

Update for comment

files <- list.files("path", pattern=".TXT", ignore.case=T)

ff <- function(input){
  data <- fread(input) 
}

a <- lapply(files, ff)
library(plyr)
binded.data <- ldply(a, function(x) rbind(x, fill = TRUE))

So, this creates a function to read files and pushes it to lapply, so you will get a list containing all your data files, each on its own dataframe.

With ldply from plyr rbind all dataframes into one dataframe.

Don't touch mydata yet.

binded.data <- data.table(binded.data, key = ID)

Depending on your mydata you will perform different merge commands. See: https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html

Update 2

files <- list.files("path", pattern=".TXT", ignore.case=T)

ff <- function(input){
data <- fread(input)
# This keeps only the rows of 'data' whose ID matches ID of 'mydata'
data <- data[ID %in% mydata[, ID]]
}

a <- lapply(files, ff)
library(plyr)
binded.data <- ldply(a, function(x) rbind(x, fill = TRUE))

Update 3

You can add cat to see the file the function is reading right now. So you can see after which file you are running out of memory. Which will point you to the direction on how many files you can read in one go.

  ff <- function(input){
# This will print name of the file it is reading now
cat(input, "\n")
data <- fread(input)
# This keeps only the rows of 'data' whose ID matches ID of 'mydata'
data <- data[ID %in% mydata[, ID]]
}

Upvotes: 6

Related Questions