Reputation: 179

Improving performance of updating contents of large data frame using contents of similar data frame

I'm looking for a general solution for updating one large data frame with the contents of a second similar data frame. I have dozens of datasets, each with thousands of rows and upwards of 10,000 columns. An "update" dataset will overlap its corresponding "base" dataset by anywhere from a few percent to perhaps 50 percent, rowwise. The datasets have a "key" column and there will be only one row per each unique key value in any given dataset.

The basic rule is: if a non-NA value exists in the update dataset for a given cell, replace the same cell in the base dataset with that value. (The "same cell" means same value of the "key" column and colname.)

Note the update dataset will likely contain new rows ("inserts") which I can handle with an rbind.

So given the base data frame "df1", where column "K" is the unique key column, and "P1" .. "P3" represent the 10,000 columns, whose names will vary from one pair of datasets to the next:

  K P1 P2 P3
1 A  1  1  1
2 B  1  1  1
3 C  1  1  1

...and the update data frame "df2":

  K P1 P2 P3
1 B  2 NA  2
2 C NA  2  2
3 D  2  2  2

The result I need is as follows, where the 1's for "B" and "C" were overwritten by the 2's but not overwritten by the NA's:

  K P1 P2 P3
1 A  1  1  1
2 B  2  1  2
3 C  1  2  2
4 D  2  2  2

This doesn't seem to be a merge candidate as merge gives me either duplicate rows (with respect to the "key" column) or duplicate columns (e.g. P1.x, P1.y), which I have to iterate over to collapse somehow.

I have tried pre-allocating a matrix with the dimensions of the final rows/columns, and populating it with the contents of df1, then iterating over the overlapping rows of df2, but I cannot get better than 20 cells per second performance, requiring hours to complete (compared to minutes for the equivalent DATA step UPDATE functionality in SAS).

I'm sure I'm missing something, but can't find a comparable example.

I see ddply usage that looks close, but not a general solution. The data.table package didn't seem to help as it's not obvious to me that this is a join problem, at least not generally over so many columns.

Also a solution that focuses only on the intersecting rows is adequate as I can identify the others and rbind them in.

Here is some code to fabricate the data frames above:

cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n");
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n");
df1 <- read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
df2 <- read.table("f2.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);

Thanks

Upvotes: 6

Answers (4)

dnlbrky

Reputation: 9805

The following gives the correct answer for the small example data, tries to minimize the number of "copies" of tables, and uses the new fread and (new?) rbindlist. Does it work with your larger actual data set? I didn't quite follow all the comments in the original post about the memory issues you had when trying to flatten/normalize/stack, so apologies if you've already tried this route.

library(data.table)
library(reshape2)

cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n")
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n")

dt1s<-data.table(melt(fread("f1.dat"), id.vars="K"), key=c("K","variable")) # read f1.dat, melt to long/stacked format, and convert to data.table

dt2s<-data.table(melt(fread("f2.dat"), id.vars="K", na.rm=T), key=c("K","variable")) # read f2.dat, melt to long/stacked format (removing NAs), and convert to data.table
setnames(dt2s,"value","value.new")

dt1s[dt2s,value:=value.new] # Update new values

dtout<-reshape(rbindlist(list(dt1s,dt1s[dt2s][is.na(value),list(K,variable,value=value.new)])), direction="wide", idvar="K", timevar="variable") # Use rbindlist to insert new records, and then reshape
setkey(dtout,K)
setnames(dtout,colnames(dtout),sub("value.", "", colnames(dtout))) # Clean up the column names

Upvotes: 0

Matt Dowle

Reputation: 59602

EDIT : Please ignore this answer. Bad idea to loop by row. It works but is very slow. Left for posterity! See my 2nd attempt as separate answer.

require(data.table)
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
K = dt2[[1]]
for (i in 1:nrow(dt2)) {
    k = K[i]
    p = unlist(dt2[i,-1,with=FALSE])
    p = p[!is.na(p)]
    dt1[J(k),names(p):=as.list(p),with=FALSE]
}

or, can you use matrix instead of data.frame? If so it could be a single line using A[B] syntax where B is a 2-column matrix containing the row and column numbers to update.

Upvotes: 1

Matt Dowle

Reputation: 59602

This loops by column, setting dt1 by reference and (hopefully) should be quick.

dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
if (!identical(names(dt1),names(dt2)))
    stop("Assumed for now. Can relax later if needed.")
w = chmatch(dt2$K, dt1$K)
for (i in 2:ncol(dt2)) {
    nna = !is.na(dt2[[i]])
    set(dt1,w[nna],i,dt2[[i]][nna])
}
dt1 = rbind(dt1,dt2[is.na(w)])
dt1
     K P1 P2 P3
[1,] A  1  1  1
[2,] B  2  1  2
[3,] C  1  2  2
[4,] D  2  2  2

Upvotes: 2

Tyler Rinker

Reputation: 109864

This is likely not the fastest solution but is done entirely in base.

(updated answer per Tommy's comments)

#READING IN YOUR DATA FRAMES
df1 <- read.table(text="  K P1 P2 P3
1 A  1  1  1
2 B  1  1  1
3 C  1  1  1", header=TRUE)

df2 <- read.table(text="  K P1 P2 P3
1 B  2 NA  2
2 C NA  2  2
3 D  2  2  2", header=TRUE)

all <- c(levels(df1$K), levels(df2$K))                  #all cells of key column
dups <- all[duplicated(all)]                            #the overlapping key cells
ndups <- all[!all %in% dups]                            #unique key cells
df3 <- rbind(df1[df1$K%in%ndups, ], df2[df2$K%in%ndups, ]) #bind the unique rows

decider <- function(x, y) ifelse(is.na(x), y, x) #function replaces NAs if existing
df4 <- data.frame(mapply(df2[df2$K%in%dups, ], df1[df1$K%in%dups, ], 
    FUN = decider)) #repalce all NAs of df2 with df1 values if they exist

df5 <- rbind(df3, df4) #bind unique rows of df1 and df2 with NA replaced df4
df5 <- df5[order(df5$K), ]  #reorder based on key column
rownames(df5) <- 1:nrow(df5)  #give proper non duplicated rownames
df5

This yields:

  K P1 P2 P3
1 A  1  1  1
2 B  2  1  2
3 C  1  2  2
4 D  2  2  2

Upon closer reading not all columns have the same name but I am assuming the same order. this may be a more helpful approach:

all <- c(levels(df1$K), levels(df2$K))
dups <- all[duplicated(all)]
ndups <- all[!all %in% dups]
LS <- list(df1, df2)
LS2 <- lapply(seq_along(LS), function(i) {
        colnames(LS[[i]]) <- colnames(LS[[2]])
        return(LS[[i]])
    }
)

LS3 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%ndups, ])
LS4 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%dups, ])

decider <- function(x, y) ifelse(is.na(x), y, x)
DF <- data.frame(mapply(LS4[[2]], LS4[[1]], FUN = decider))
DF$K <- LS4[[1]]$K
LS3[[3]] <- DF
df5 <- do.call("rbind", LS3)
df5 <- df5[order(df5$K), ]
rownames(df5) <- 1:nrow(df5)
df5

Upvotes: 2

Improving performance of updating contents of large data frame using contents of similar data frame

Answers (4)

Related Questions