Reputation: 322

Best way to store 1 trillion lines of information

I'm doing calculations and the resultant text file right now has 288012413 lines, with 4 columns. Sample column:

288012413; 4855 18668 5.5677643628300215

the file is nearly 12 GB's.

That's just unreasonable. It's plain text. Is there a more efficient way? I only need about 3 decimal places, but would a limiter save much room?

Upvotes: 1

Answers (8)

Eugene Mayevski 'Callback

Reputation: 46095

The most obvious answer is just "split the data". Put them to different files, eg. 1 mln lines per file. NTFS is quite good at handling hundreds of thousands of files per folder.

Then you've got a number of answers regarding reducing data size.

Next, why keep the data as text if you have a fixed-sized structure? Store the numbers as binaries - this will reduce the space even more (text format is very redundant).

Finally, DBMS can be your best friend. NoSQL DBMS should work well, though I am not an expert in this area and I dont know which one will hold a trillion of records.

If I were you, I would go with the fixed-sized binary format, where each record occupies the fixed (16-20?) bytes of space. Then even if I keep the data in one file, I can easily determine at which position I need to start reading the file. If you need to do lookup (say by column 1) and the data is not re-generated all the time, then it could be possible to do one-time sorting by lookup key after generation -- this would be slow, but as a one-time procedure it would be acceptable.

Upvotes: 0

Tim Williscroft

Reputation: 3756

Like AShelly, but smaller.

Assuming line #'s are continuous...

struct x { short thing1; short thing2; short value; // you said only 3dp. so store as fixed point n*1000. you get 2 digits left of dp }

save in binary file.

lseek() read() and write() are your friends.

file will be large(ish) at around 1.7Gb.

Upvotes: 0

AShelly

Reputation: 35600

If you are going to use the result as a lookup table, why use ASCII for numeric data? why not define a struct like so:

struct x {
   long lineno;
   short thing1; 
   short thing2;
   double value;
}

and write the struct to a binary file? Since all the records are of a known size, advancing through them later is easy.

Upvotes: 1

Luke101

Reputation: 65308

Go ahead and use MySQL database

MSSQL express has a limit of 4GB
MS Access has a limit of 4 GB

So these options are out. I think by using a simple database like mysql or sSQLLite without indexing will be your best bet. It will probably be faster accessing the data using a database anyway and on top of that the file size may be smaller.

Upvotes: 2

Nate Koppenhaver

Reputation: 1702

well, if the files are that big, and you are doing calculations that require any sort of precision with the numbers, you are not going to want a limiter. That might possibly do more harm than good, and with a 12-15 GB file, problems like that will be really hard to debug. I would use some compression utility, such as GZIP, ZIP, BlakHole, 7ZIP or something like that to compress it.

Also, what encoding are you using? If you are just storing numbers, all you need is ASCII. If you are using Unicode encodings, that will double to quadruple the size of the file vs. ASCII.

Upvotes: 0

Justin

Reputation: 86789

Well,

The first column looks suspiciously like a line number - if this is the case then you can probably just get rid of it saving around 11 characters per line.
If you only need about 3 decimal places then you can round / truncate the last column, potentially saving another 12 characters per line.

I.e. you can get rid of 23 characters per line. That line is 40 characters long, so you can approximatley halve your file size.

If you do round the last column then you should be aware of the effect that rounding errors may have on your calculations - if the end result needs to be accurate to 3 dp then you might want to keep a couple of extra digits of precision depending on the type of calculation.

You might also want to look into compressing the file if it is just used to storing the results.

Upvotes: 1