Steve
Steve

Reputation: 598

R repeatedly update dataframe with results from for loop

I've been searching through possible solutions, and found something that partially works for my application. I am adding missing data to an existing data frame in "chunks" through a loop. The update has rows that don't exist in the main table.

The problem I am running into is that I need to insert the rows from the second table that don't exist in the first table, and fill in the Value1 column with 0's where the new rows are added.

(Each SubDF is over 50k rows long, and there can be 50 SubDF's to iterate through per MainDF, resulting in a 2.5mil row MainDF)

Current code: (Pardon the loop coding, it's not functional. Just using it for iliustration)

Main_DF
df_list <- c(Sub_DF1, Sub_DF2, SubDF3)

for (i in df_list){

Sub_DF <- i

############## Code in question
setDT(Main_DF)
setDT(Sub_DF)

Main_DF[Sub_DF, 
   on=c("Path1", "Path2","File_Name", "ID"),  
   c("value2") := .(i.value2)] 
###############
}

Various other permutations I have tried:

#
setDT(Main_DF)
setDT(Sub_DF)
setkeyv(Main_DF, c("Path1", "Path2","File_Name", "ID"))
setkeyv(Sub_DF, c("Path1", "Path2","File_Name", "ID"))
Main_DF<- Main_DF[Sub_DF]

#    
Main_DF<- merge(Main_DF, Sub_DF, by = c("Path1", "Path2","File_Name", "ID"), 
all = TRUE)

#
Main_DF[Sub_DF, on=c("Path1", "Path2","File_Name", "ID"),  
           c("Value2") := .(i.Value2)] 

# 
Main_DF[Sub_DF,]

#    
Main_DF<- Main_DF[Sub_DF, on=c("Path1", "Path2","File_Name", "ID")]

#    
Main_DF[Sub_DF, on=.("Path1", "Path2","File_Name", "ID"),  `:=` (Value2= 
i.Value2)] 

#
Main_DF <- merge(Main_DF,Sub_DF,by=c("Path1", "Path2","File_Name", "ID"),all 
=T, fill.NA = 0)

Main DF

Path1   Path2   File_Name   ID   Value1
root    home    Sample1     1    1
root    home    Sample1     2    0
root    home    Sample1     7    1
root    home    Sample2     1    0
root    home    Sample2     2    1
root    home    Sample2     3    1
root    home    Sample2     8    1
root    home    Sample3     1    0
root    home    Sample3     2    1
root    home    Sample3     6    1

Sub DF (1st iteration of loop)

Path1   Path2   File_Name  ID  Value2
root    home    Sample1    1   5000
root    home    Sample1    2   9000
root    home    Sample1    5   400
root    home    Sample1    6   3500
root    home    Sample1    7   8500
root    home    Sample1    8   2200

Sub DF (2nd iteration of loop)

Path1   Path2   File_Name  ID  Value2
root    home    Sample2    1   5000
root    home    Sample2    2   9000
root    home    Sample2    3   700
root    home    Sample2    5   400
root    home    Sample2    6   3500
root    home    Sample2    7   8500
root    home    Sample2    8   2200

Sub DF (3rd iteration of loop)

Path1   Path2   File_Name  ID  Value2
root    home    Sample3    1   5000
root    home    Sample3    2   9000
root    home    Sample3    5   400
root    home    Sample3    6   3500
root    home    Sample3    7   8500
root    home    Sample3    8   2200

ACTUAL Updated Main DF (After iterating through the 3 SubDF's)

Path1   Path2   File_Name   ID   Value1   Value2
root    home    Sample1     1    1        5000
root    home    Sample1     2    0        9000
root    home    Sample1     7    1        8500
root    home    Sample2     1    0        5000   
root    home    Sample2     2    1        9000
root    home    Sample2     3    1        700 
root    home    Sample2     8    1        8800
root    home    Sample3     1    0        5000 
root    home    Sample3     2    1        9000
root    home    Sample3     6    1        3500

DESIRED Updated Main DF

Path1   Path2   File_Name   ID   Value1   Value2
root    home    Sample1     1    1        5000
root    home    Sample1     2    0        9000
root    home    Sample1     5    1        400
root    home    Sample1     6    1        3500
root    home    Sample1     7    0        8500
root    home    Sample1     8    0        2200
root    home    Sample2     1    0        5000
root    home    Sample2     2    1        9000
root    home    Sample2     3    1        700
root    home    Sample2     5    0        400
root    home    Sample2     6    0        3500
root    home    Sample2     7    0        8500
root    home    Sample2     8    1        2200
root    home    Sample3     1    0        5000
root    home    Sample3     2    1        9000
root    home    Sample3     5    0        400
root    home    Sample3     6    1        3500
root    home    Sample3     7    0        8500
root    home    Sample3     8    0        2200 

Upvotes: 2

Views: 1550

Answers (2)

Steve
Steve

Reputation: 598

So it wasn't the implementation I was looking for, but @chinsoon12 solution works in a different way.

In the loop that processes the SubDF's, i assign the rbind to a new dataframe. I then push that smaller dataframe to my database before the loop recycles, overwriting the dataframe with the next SubDF data.

This was one of the avenues I had planned on looking at if I couldn't get the data.table join to work the way I was hoping.

Thanks for the help!

Upvotes: 0

chinsoon12
chinsoon12

Reputation: 25225

If I am not mistaken, this should be what you are looking for:

rbindlist(list(Sub_DF1, Sub_DF2, Sub_DF3))[
    Main_DF, on=.(Path1, Path2, File_Name, ID), Value1 := i.Value1][
        is.na(Value1), Value1 := 0L][]

output:

    Path1 Path2 File_Name ID Value2 Value1
 1:  root  home   Sample1  1   5000      1
 2:  root  home   Sample1  2   9000      0
 3:  root  home   Sample1  5    400      0
 4:  root  home   Sample1  6   3500      0
 5:  root  home   Sample1  7   8500      1
 6:  root  home   Sample1  8   2200      0
 7:  root  home   Sample2  1   5000      0
 8:  root  home   Sample2  2   9000      1
 9:  root  home   Sample2  3    700      1
10:  root  home   Sample2  5    400      0
11:  root  home   Sample2  6   3500      0
12:  root  home   Sample2  7   8500      0
13:  root  home   Sample2  8   2200      1
14:  root  home   Sample3  1   5000      0
15:  root  home   Sample3  2   9000      1
16:  root  home   Sample3  5    400      0
17:  root  home   Sample3  6   3500      1
18:  root  home   Sample3  7   8500      0
19:  root  home   Sample3  8   2200      0

data:

library(data.table)
Main_DF <- fread("Path1   Path2   File_Name   ID   Value1
root    home    Sample1     1    1
root    home    Sample1     2    0
root    home    Sample1     7    1
root    home    Sample2     1    0
root    home    Sample2     2    1
root    home    Sample2     3    1
root    home    Sample2     8    1
root    home    Sample3     1    0
root    home    Sample3     2    1
root    home    Sample3     6    1")

Sub_DF1 <- fread("Path1   Path2   File_Name  ID  Value2
root    home    Sample1    1   5000
root    home    Sample1    2   9000
root    home    Sample1    5   400
root    home    Sample1    6   3500
root    home    Sample1    7   8500
root    home    Sample1    8   2200")

Sub_DF2 <- fread("Path1   Path2   File_Name  ID  Value2
root    home    Sample2    1   5000
root    home    Sample2    2   9000
root    home    Sample2    3   700
root    home    Sample2    5   400
root    home    Sample2    6   3500
root    home    Sample2    7   8500
root    home    Sample2    8   2200")

Sub_DF3 <- fread("Path1   Path2   File_Name  ID  Value2
root    home    Sample3    1   5000
root    home    Sample3    2   9000
root    home    Sample3    5   400
root    home    Sample3    6   3500
root    home    Sample3    7   8500
root    home    Sample3    8   2200")

Upvotes: 2

Related Questions