Reputation: 615

What is the least memory demanding methods to do these steps?

I have posted a question yesterday, and got wonderful response from the experts. However, I am facing another question now, I found the jobs cannot be done in my real data as my starting file (df1) are too large. I wonder if there are faster method to do the same job without using adply or for loop?

My original questions is listed as below:

Step 1: I have a simplified dataframe like this:

df1 = data.frame (B=c(1,0,1), C=c(1,1,0)
  , D=c(1,0,1), E=c(1,1,0), F=c(0,0,1)
  , G=c(0,1,0), H=c(0,0,1), I=c(0,1,0))

  B C D E F G H I
1 1 1 1 1 0 0 0 0
2 0 1 0 1 0 1 0 1
3 1 0 1 0 1 0 1 0

Step 2: I want to do row wise subtraction, i.e. (row1 - row2), (row1 - row3) and (row2 - row3)

row1-row2    1  0    1  0    0  -1   0  -1
row1-row3    0  1    0  1   -1   0  -1   0
row2-row3   -1  1   -1  1   -1   1  -1   1

step 3: replace all -1 to 0

row1-row2   1   0   1   0   0   0   0   0
row1-row3   0   1   0   1   0   0   0   0
row2-row3   0   1   0   1   0   1   0   1

Could you mind to teach me how to do so in a less memory-demanding approach?

Upvotes: 1

Answers (3)

Martin Morgan

Reputation: 46856

Since the data is homogeneous, use a matrix representation. Organize it so that the 'rows' are columns, as

m <- t(as.matrix(df1))
mode(m) <- "integer"  # maybe already true?

pre-allocate the space for an answer

n <- ncol(m) - 1
ans <- matrix(0L, nrow(m), (n+1) * n / 2)

We want to compare column 1 to columns 1:n + 1L (the 1L treats the number one as an integer value, rather than real). This is m[,1] - m[, 1:n + 1L], using R's recycling. Iterating over columns, with idx and off helping to keep track of the index of the columns we want to compare to, and the placement columns in the answer

off <- 0
for (i in 1:n) {
    idx <- i:n + 1L
    ans[, off + seq_along(idx)] <- m[, i] - m[, idx]
    off <- off + length(idx)
}

The final step is

ans[ans<0L] <- 0L

Maybe there are additional efficiencies from realizing that the truth table under the original operation is 0 unless m[,1] == 1 & m[, 1:n + 1L] == 0. Likewise if space were a serious issue then the data might be represented as mode(m) <- "raw" and the arithmetic operations replaced with the comparison just suggested, along the lines of:

m <- t(as.matrix(df1))
mode(m) <- "raw"

off <- 0
x0 <- as.raw(0); x1 <- as.raw(1)
ans <- matrix(raw(), nrow(m), (n+1) * n / 2)
for (i in 1:n) {
    idx <- i:n + 1L
    updt <- which((m[, i] == x1) & (m[, idx] == x0))
    ans[off + updt] <- x1
    off <- off + length(idx) * nrow(ans)
}

Upvotes: 3

G. Grothendieck

Reputation: 269441

Using the sqldf package or RSQLite directly would allow one to do this with all computations done outside of R so that there would be no intermediate storage required. We illustrate using sqldf. See the sqldf home page for more info.

Alternative 1 In this approach note that we use dbname = tempfile() so that it performs all computations in an external database (which it creates on the fly and automatically deletes) rather than doing it in memory.

library(sqldf)
gc()
DF <- sqldf("select x.rowid x, y.rowid y,
    max(x.B - y.B, 0) B, max(x.C - y.C, 0) C, 
    max(x.D - y.D, 0) D, max(x.E - y.E, 0) E,
    max(x.F - y.F, 0) F, max(x.G - y.G, 0) G, 
    max(x.H - y.H, 0) H, max(x.I - y.I, 0) I
    from df1 x, df1 y
    where x.rowid > y.rowid", dbname = tempfile())

This would only require that we are able to store df1 and DF in our workspace.

Alternative 2. If even that overflows we can write out df1, remove it, perform the calculation below and then we would only need sufficient storage to store the result, DF.

read.csv.sql uses dbname = tempfile() by default so in this case we do not need to specify it.

write.table(df1, "data.txt", sep = ",", quote = FALSE)
rm(df1)
gc()
DF <- read.csv.sql("data.txt", sql = "select
    x.rowid x, y.rowid y, 
    max(x.B - y.B, 0) B, max(x.C - y.C, 0) C, 
    max(x.D - y.D, 0) D, max(x.E - y.E, 0) E,
    max(x.F - y.F, 0) F, max(x.G - y.G, 0) G, 
    max(x.H - y.H, 0) H, max(x.I - y.I, 0) I
    from file x, file y
    where x.rowid > y.rowid")

(Of course, if its really this large then you might have trouble doing any subsequent calculations on it too.)

Output. At any rate, both alternatives give the same result shown below. x and y show which input rows were subtracted.

> DF
  x y B C D E F G H I
1 2 1 0 0 0 0 0 1 0 1
2 3 1 0 0 0 0 1 0 1 0
3 3 2 1 0 1 0 1 0 1 0

Note. Although the question asked for optimizing memory rather than speed if speed were an issue one could add indexes.

Upvotes: 3

Gavin Simpson

Reputation: 174788

The fastest way I know to do step 2 is to use indices into df1 for the various pairwise comparisons you want to do. The combn() function can be used to generate the set of row-by-row comparisons required. (Using this will be the rate limiting step for big data sets.)

For the combinations of row-by-rows operations we want to form:

> cmb <- combn(as.numeric(rownames(df1)), 2)
> cmb
     [,1] [,2] [,3]
[1,]    1    1    2
[2,]    2    3    3

The rows of cmb represent the two sets of indices required from df1 required to form the three rows of your requested output. (The columns, 3, represent the 3 rows in your expected result.)

The next step is to use the two rows of cmb to index df1 and use a standard vectorised operation in R via -, e.g.:

> (out <- df1[cmb[1,], ] - df1[cmb[2,], ])
     B C  D E  F  G  H  I
1    1 0  1 0  0 -1  0 -1
1.1  0 1  0 1 -1  0 -1  0
2   -1 1 -1 1 -1  1 -1  1

Step 3 can now be done, although I am assuming that there can only be 1, 0, and -1 values in the resulting output:

> out[out < 0] <- 0
> out
    B C D E F G H I
1   1 0 1 0 0 0 0 0
1.1 0 1 0 1 0 0 0 0
2   0 1 0 1 0 1 0 1

Which is consistent with the output you requested.

For big operations, doing this with matrices might be faster overall. So we could do:

> mat <- data.matrix(df1)
> cmb <- combn(seq_len(NROW(mat)), 2)
> cmb
     [,1] [,2] [,3]
[1,]    1    1    2
[2,]    2    3    3
> out2 <- mat[cmb[1,], ] - mat[cmb[2,], ]
> out2[out2 < 0] <- 0
> out2
     B C D E F G H I
[1,] 1 0 1 0 0 0 0 0
[2,] 0 1 0 1 0 0 0 0
[3,] 0 1 0 1 0 1 0 1

If you need the rownames as you show, then you can easily generate these at the end:

> apply(cmb, 2, function(x) paste("row", x[1], "-row", x[2], sep = ""))
[1] "row1-row2" "row1-row3" "row2-row3"

which can be used as:

> rownames(out) <- apply(cmb, 2, function(x) paste("row", x[1], "-row", x[2], sep = ""))
> out
          B C D E F G H I
row1-row2 1 0 1 0 0 0 0 0
row1-row3 0 1 0 1 0 0 0 0
row2-row3 0 1 0 1 0 1 0 1

Upvotes: 4

What is the least memory demanding methods to do these steps?

Answers (3)

Related Questions