Reputation: 598
I've been searching through possible solutions, and found something that partially works for my application. I am adding missing data to an existing data frame in "chunks" through a loop. The update has rows that don't exist in the main table.
The problem I am running into is that I need to insert the rows from the second table that don't exist in the first table, and fill in the Value1 column with 0's where the new rows are added.
(Each SubDF is over 50k rows long, and there can be 50 SubDF's to iterate through per MainDF, resulting in a 2.5mil row MainDF)
Current code: (Pardon the loop coding, it's not functional. Just using it for iliustration)
Main_DF
df_list <- c(Sub_DF1, Sub_DF2, SubDF3)
for (i in df_list){
Sub_DF <- i
############## Code in question
setDT(Main_DF)
setDT(Sub_DF)
Main_DF[Sub_DF,
on=c("Path1", "Path2","File_Name", "ID"),
c("value2") := .(i.value2)]
###############
}
Various other permutations I have tried:
#
setDT(Main_DF)
setDT(Sub_DF)
setkeyv(Main_DF, c("Path1", "Path2","File_Name", "ID"))
setkeyv(Sub_DF, c("Path1", "Path2","File_Name", "ID"))
Main_DF<- Main_DF[Sub_DF]
#
Main_DF<- merge(Main_DF, Sub_DF, by = c("Path1", "Path2","File_Name", "ID"),
all = TRUE)
#
Main_DF[Sub_DF, on=c("Path1", "Path2","File_Name", "ID"),
c("Value2") := .(i.Value2)]
#
Main_DF[Sub_DF,]
#
Main_DF<- Main_DF[Sub_DF, on=c("Path1", "Path2","File_Name", "ID")]
#
Main_DF[Sub_DF, on=.("Path1", "Path2","File_Name", "ID"), `:=` (Value2=
i.Value2)]
#
Main_DF <- merge(Main_DF,Sub_DF,by=c("Path1", "Path2","File_Name", "ID"),all
=T, fill.NA = 0)
Main DF
Path1 Path2 File_Name ID Value1
root home Sample1 1 1
root home Sample1 2 0
root home Sample1 7 1
root home Sample2 1 0
root home Sample2 2 1
root home Sample2 3 1
root home Sample2 8 1
root home Sample3 1 0
root home Sample3 2 1
root home Sample3 6 1
Sub DF (1st iteration of loop)
Path1 Path2 File_Name ID Value2
root home Sample1 1 5000
root home Sample1 2 9000
root home Sample1 5 400
root home Sample1 6 3500
root home Sample1 7 8500
root home Sample1 8 2200
Sub DF (2nd iteration of loop)
Path1 Path2 File_Name ID Value2
root home Sample2 1 5000
root home Sample2 2 9000
root home Sample2 3 700
root home Sample2 5 400
root home Sample2 6 3500
root home Sample2 7 8500
root home Sample2 8 2200
Sub DF (3rd iteration of loop)
Path1 Path2 File_Name ID Value2
root home Sample3 1 5000
root home Sample3 2 9000
root home Sample3 5 400
root home Sample3 6 3500
root home Sample3 7 8500
root home Sample3 8 2200
ACTUAL Updated Main DF (After iterating through the 3 SubDF's)
Path1 Path2 File_Name ID Value1 Value2
root home Sample1 1 1 5000
root home Sample1 2 0 9000
root home Sample1 7 1 8500
root home Sample2 1 0 5000
root home Sample2 2 1 9000
root home Sample2 3 1 700
root home Sample2 8 1 8800
root home Sample3 1 0 5000
root home Sample3 2 1 9000
root home Sample3 6 1 3500
DESIRED Updated Main DF
Path1 Path2 File_Name ID Value1 Value2
root home Sample1 1 1 5000
root home Sample1 2 0 9000
root home Sample1 5 1 400
root home Sample1 6 1 3500
root home Sample1 7 0 8500
root home Sample1 8 0 2200
root home Sample2 1 0 5000
root home Sample2 2 1 9000
root home Sample2 3 1 700
root home Sample2 5 0 400
root home Sample2 6 0 3500
root home Sample2 7 0 8500
root home Sample2 8 1 2200
root home Sample3 1 0 5000
root home Sample3 2 1 9000
root home Sample3 5 0 400
root home Sample3 6 1 3500
root home Sample3 7 0 8500
root home Sample3 8 0 2200
Upvotes: 2
Views: 1550
Reputation: 598
So it wasn't the implementation I was looking for, but @chinsoon12 solution works in a different way.
In the loop that processes the SubDF's, i assign the rbind to a new dataframe. I then push that smaller dataframe to my database before the loop recycles, overwriting the dataframe with the next SubDF data.
This was one of the avenues I had planned on looking at if I couldn't get the data.table join to work the way I was hoping.
Thanks for the help!
Upvotes: 0
Reputation: 25225
If I am not mistaken, this should be what you are looking for:
rbindlist(list(Sub_DF1, Sub_DF2, Sub_DF3))[
Main_DF, on=.(Path1, Path2, File_Name, ID), Value1 := i.Value1][
is.na(Value1), Value1 := 0L][]
output:
Path1 Path2 File_Name ID Value2 Value1
1: root home Sample1 1 5000 1
2: root home Sample1 2 9000 0
3: root home Sample1 5 400 0
4: root home Sample1 6 3500 0
5: root home Sample1 7 8500 1
6: root home Sample1 8 2200 0
7: root home Sample2 1 5000 0
8: root home Sample2 2 9000 1
9: root home Sample2 3 700 1
10: root home Sample2 5 400 0
11: root home Sample2 6 3500 0
12: root home Sample2 7 8500 0
13: root home Sample2 8 2200 1
14: root home Sample3 1 5000 0
15: root home Sample3 2 9000 1
16: root home Sample3 5 400 0
17: root home Sample3 6 3500 1
18: root home Sample3 7 8500 0
19: root home Sample3 8 2200 0
data:
library(data.table)
Main_DF <- fread("Path1 Path2 File_Name ID Value1
root home Sample1 1 1
root home Sample1 2 0
root home Sample1 7 1
root home Sample2 1 0
root home Sample2 2 1
root home Sample2 3 1
root home Sample2 8 1
root home Sample3 1 0
root home Sample3 2 1
root home Sample3 6 1")
Sub_DF1 <- fread("Path1 Path2 File_Name ID Value2
root home Sample1 1 5000
root home Sample1 2 9000
root home Sample1 5 400
root home Sample1 6 3500
root home Sample1 7 8500
root home Sample1 8 2200")
Sub_DF2 <- fread("Path1 Path2 File_Name ID Value2
root home Sample2 1 5000
root home Sample2 2 9000
root home Sample2 3 700
root home Sample2 5 400
root home Sample2 6 3500
root home Sample2 7 8500
root home Sample2 8 2200")
Sub_DF3 <- fread("Path1 Path2 File_Name ID Value2
root home Sample3 1 5000
root home Sample3 2 9000
root home Sample3 5 400
root home Sample3 6 3500
root home Sample3 7 8500
root home Sample3 8 2200")
Upvotes: 2