Reputation: 357
I am using a for loop to merge multiple files with another file:
files <- list.files("path", pattern=".TXT", ignore.case=T)
for(i in 1:length(files))
{
data <- fread(files[i], header=T)
# Merge
mydata <- merge(mydata, data, by="ID", all.x=TRUE)
rm(data)
}
"mydata" looks as follows (simplified):
ID x1 x2
1 2 8
2 5 5
3 4 4
4 6 5
5 5 8
"data" looks as follows (around 600 files, in total 100GB). Example of 2 (seperate) files. Integrating all in 1 would be impossible (too large):
ID x3
1 8
2 4
ID x3
3 4
4 5
5 1
When I run my code I get the following dataset:
ID x1 x2 x3.x x3.y
1 2 8 8 NA
2 5 5 4 NA
3 4 4 NA 4
4 6 5 NA 5
5 5 8 NA 1
What I would like to get is:
ID x1 x2 x3
1 2 8 8
2 5 5 4
3 4 4 4
4 6 5 5
5 5 8 1
ID's are unique (never duplicates over the 600 files).
Any idea on how to achieve this as efficiently as possible much appreciated.
Upvotes: 1
Views: 9977
Reputation: 2313
It's better suited as comment, But I can't comment yet.
Would it not be better to rbind instead of merge? This seems to be what you want to acomplish.
Set fill
argument TRUE
to take care of different column numbers:
asd <- data.table(x1 = c(1, 2), x2 = c(4, 5))
a <- data.table(x2 = 5)
rbind(asd, a, fill = TRUE)
x1 x2
1: 1 4
2: 2 5
3: NA 5
Do this with data
and then merge into mydata
by ID
.
Update for comment
files <- list.files("path", pattern=".TXT", ignore.case=T)
ff <- function(input){
data <- fread(input)
}
a <- lapply(files, ff)
library(plyr)
binded.data <- ldply(a, function(x) rbind(x, fill = TRUE))
So, this creates a function to read files and pushes it to lapply
, so you will get a list containing all your data
files, each on its own dataframe.
With ldply
from plyr
rbind all dataframes into one dataframe.
Don't touch mydata yet.
binded.data <- data.table(binded.data, key = ID)
Depending on your mydata
you will perform different merge
commands.
See:
https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html
Update 2
files <- list.files("path", pattern=".TXT", ignore.case=T)
ff <- function(input){
data <- fread(input)
# This keeps only the rows of 'data' whose ID matches ID of 'mydata'
data <- data[ID %in% mydata[, ID]]
}
a <- lapply(files, ff)
library(plyr)
binded.data <- ldply(a, function(x) rbind(x, fill = TRUE))
Update 3
You can add cat
to see the file the function is reading right now. So you can see after which file you are running out of memory. Which will point you to the direction on how many files you can read in one go.
ff <- function(input){
# This will print name of the file it is reading now
cat(input, "\n")
data <- fread(input)
# This keeps only the rows of 'data' whose ID matches ID of 'mydata'
data <- data[ID %in% mydata[, ID]]
}
Upvotes: 6