MHtaylor
MHtaylor

Reputation: 29

Read.table and dbWriteTable result in different output?

I'm working with 12 large data files, all of which hover between 3 and 5 GB, so I was turning to RSQLite for import and initial selection. Giving a reproducible example in this case is difficult, so if you can come up with anything, that would be great.

If I take a small set of the data, read it in, and write it to a table, I get exactly what I want:

con <- dbConnect("SQLite", dbname = "R2")
f <- file("chr1.ld") 
open(f) 
data <- read.table(f, nrow=100, header=TRUE)
dbWriteTable(con, name = "Chr1test", value = data)

> dbListFields(con, "Chr1test")
[1] "row_names" "CHR_A"     "BP_A"      "SNP_A"     "CHR_B"     "BP_B"      "SNP_B"     "R2"     

> dbGetQuery(con, "SELECT * FROM Chr1test LIMIT 2")
  row_names CHR_A BP_A      SNP_A CHR_B BP_B       SNP_B         R2
1         1     1 1579 SNP-1.578.     1 2097 SNP-1.1096. 0.07223050
2         2     1 1579 SNP-1.578.     1 2553 SNP-1.1552. 0.00763724

If I read in all of my data directly to a table, though, my columns aren't separated correctly. I've tried both sep = " " and sep = "\t", but both give the same column separation

dbWriteTable(con, name = "Chr1", value ="chr1.ld", header = TRUE)
> dbListFields(con, "Chr1")
[1] "CHR_A_________BP_A______________SNP_A__CHR_B_________BP_B______________SNP_B___________R

I can tell that it's clearly some sort of delimination issue, but I've exhausted my ideas on how to fix it. Has anyone run into this before?

*Edit, update: It seems as though this works:

n <- 1000000
f <- file("chr1.ld")
open(f)
data <- read.table(f, nrow = n, header = TRUE)

con_data <- dbConnect("SQLite", dbname = "R2")
while (nrow(data) == n){
  dbWriteTable(con_data, data, name = "ch1", append = TRUE, header = TRUE)
  data <- read.table(f, nrow = n, header = TRUE)
}
close(f)
if (nrow(data) != 0){
  dbWriteTable(con_data, data, name = "ch1", append = TRUE)
}

Though I can't quite figure out why just writing the table through SQLite is a problem. Possibly a memory issue.

Upvotes: 0

Views: 360

Answers (1)

Patrick
Patrick

Reputation: 91

I am guessing that your big file is causing a free memory issue (see Memory Usage under docs for read.table). It would have been helpful to show us the first few lines of chr1.ld (on *nix systems you just say "head -n 5 chr1.ld" to get the first five lines). If it is a memory issue, then you might try sipping the file as a work-around rather than gulping it whole. Determine or estimate the number of lines in chr1.ld (on *nix systems, say "wc -l chr1.ld"). Let's say your file has 100,000 lines.

`sip.size = 100
for (i in seq(0,100000,sip.size)) {
    data <- read.table(f, nrow=sip.size, skip=i, header=TRUE)
    dbWriteTable(con, name = "SippyCup", value = data, append=TRUE)
}`

You'll probably see warnings at the end but the data should make it through. If you have character data that read.table is trying to factor, this kludge will be unsatisfactory unless there are only a few factors, all of which are guaranteed to occur in every chunk. You may need to tell read.table not to factor those columns or use some other method to look at all possible factors so you can list them for read.table. (On *nix, split out one column and pipe it to uniq.)

Upvotes: 1

Related Questions