Reputation: 21
My R application reads input data from large txt files. it does not read the entire file in one shot. Users specify the name of the gene, (3 or 4 at a time) and based on the user-input, app goes to the appropriate row and reads the data.
File format: 32,000 rows (one gene per row, first two columns contain info about gene name, etc.) 35,000 columns with numerical data (decimal numbers).
I used read.table (filename, skip=10,000 ) etc. to go to the right row, then read 35,000 columns of data. then I do this again for the 2nd gene, 3rd gene (upto 4 genes max) and then process the numerical results.
The file reading operations take about 1.5 to 2.0 Minutes. I am experimenting with reading the entire file and then taking the data for the desired genes.
Any way to accelerate this? I can rewrite the gene data in another format (one time processing) if that will accelerate reading operations in the future.
Upvotes: 2
Views: 261
Reputation: 263332
This would be more efficient if you used a database interface. There are several available via the RODBC
package, but a particularly well-integrated-with-R option would be the sqldf
package which by default uses SQLite. You would then be able to use the indexing capacity of the database to do lookup of the correct rows and read all the columns in one operation.
Upvotes: 2
Reputation: 57686
You can use the colClasses
argument to read.table
to speed things up, if you know the exact format of your files. For 2 character columns and 34,998 (?) numeric columns, you would use
colClasses = c(rep("character",2), rep("numeric",34998))
Upvotes: 2