AI52487963
AI52487963

Reputation: 179

Writing a table to multiple files in R

Seemingly simple question, but I don't know how the loop syntax and variable assignments work in R very well. I have a 6900 line table that I want parsed into 10 equal sized text files. My code is below, but how would I design a loop around it and iterate through the filenames?

write.table(clipboard[1:619,1], 
              "mydata1.txt",sep="\t")
  write.table(clipboard[619:1238,1], 
              "mydata2.txt",sep="\t")
  write.table(clipboard[1238:1857,1], 
              "mydata3.txt",sep="\t")
  write.table(clipboard[1857:2476,1], 
              "mydata4.txt",sep="\t")
  write.table(clipboard[2476:3095,1], 
              "mydata5.txt",sep="\t")
  write.table(clipboard[3095:3714,1], 
              "mydata6.txt",sep="\t")
  write.table(clipboard[3714:4333,1], 
              "mydata7.txt",sep="\t")
  write.table(clipboard[4333:4952,1], 
              "mydata8.txt",sep="\t")
  write.table(clipboard[4952:5571,1], 
              "mydata9.txt",sep="\t")
  write.table(clipboard[5571:6190,1], 
              "mydata10.txt",sep="\t")

Upvotes: 0

Views: 273

Answers (1)

PascalVKooten
PascalVKooten

Reputation: 21433

The manual way

I guess not such an issue to use a loop for IO:

for (i in 1:10) {
  start <- 1 + (i-1) * nrow(clipboard) / 10
  end <- i * nrow(clipboard) / 10
  fname <- paste("mydata", i ,".txt", sep="")
  write.table(x=clipboard[start:end, 1], file=fname, sep="\t")
}

Note that this assumes that it can actually be separated into 10 equally sized files!

Done properly, write.split:

This method will actually (when not perfectly divisable) create an extra file for the remainder.

I used this splitter to create a list of data that will then be used in parallel for some statistical computations in my package correlate. Here, it actually means we would be able to write the files in parallel. Note that this is pointless for small files; maybe even slower.

# Helper to split the data in chunks
splitter <- function(x, splitsize) {
  nr <- nrow(x)
  if (splitsize > nr) {
    splitsize <- nr
  }
  splits <- floor(nr / splitsize)
  splitted.list <- lapply(split(x[seq_len(splits*splitsize), ],
                          seq_len(splits)), function(x) matrix(x, splitsize))
  if (nr %% splitsize != 0) {
    splitted.list$last <- x[(splits * splitsize + 1):nr, ]
  }
  return(splitted.list)
}

write.split <- function(x, chunks, file.prefix, file.extension, cores = 1, ...) {
  splitsize <- nrow(x) / chunks
  splitted.list <- splitter(x, splitsize)
  if (cores == 1) {
    sapply(names(splitted.list), function(z) 
           write.table(splitted.list[z],
                       file = paste(file.prefix, z, file.extension, sep=""),
                       ...))
  } else {
    # currently just the simple linux version; this won't work on Windows.
    # Upon request I'll add it
    stopifnot(require(parallel))
    mclapply(names(splitted.list), function(z) 
           write.table(splitted.list[z],
                       file = paste(file.prefix, z, file.extension, sep=""),
                       ...))
  }
} 

Usage:

write.split(z, chunks = 10,
            file.prefix = "mydata", file.extension = ".txt", sep="\t")

You can also give it the row.names and col.names arguments, basically anything that can be passed to write.table.

Benchmark:

Using `matrix(1:1000000, 1000)` as data.
Unit: seconds
   expr         min       lq   median       uq      max neval
 1-core    1.780022 1.990751 2.079907 2.166891 2.744904   100 
4-cores    1.305048 1.438777 1.492114 1.559110 2.070911   100

Extensibility: It could also be easily extended by allowing to give a number of lines to write rather than the amount of chunks.

Upvotes: 2

Related Questions