How do I quickly fill up a multi-petabyte NAS?

Question

My company's product will produce petabytes of data each year at our client sites. I want to fill up a multi-petabyte NAS to simulate a system that has been running for a long time (3 months, 6 months, a year, etc). We want to analyze our software while it's running on a storage system under load.

I could write a script that creates this data (a single script could take weeks or months to execute). Are there recommendations on how to farm out the script (multiple machines, multiple threads)? The NAS has 3 load balanced incoming links... should I run directly on the NAS device?

Are there third-party products that I could use to create load? I don't even know how to start searching for products like this.

Does it matter if the data is realistic? Does anyone know anything about NAS/storage architecture? Can it just be random bits or does the regularity of the data matter? We fanning the data out on disk in this format

x:\\\\\.ext

Malcolm Box · Accepted Answer

You are going to be limited by the write speed of the NAS/disks - I can think of no way of getting round that.

So the challenge then is simply to write-saturate the disks for as long as needed. A script or set of scripts running on a reasonable machine should be able to do that without difficulty.

To get started, use something like Bonnie++ to find out how fast your disks can write. Then you could use the code from Bonnie as a starting point to saturate the writes - after all, to benchmark a disk Bonnie has to be able to write faster than the NAS.

Assuming you have 3x1GB ethernet connections, the max network input to the box is about 300 MB/s. A PC is capable of saturating a 1GB ethernet connection, so 3 PCs should work. Get each PC to write a section of the tree and voila.

Of course, to fill a petabyte at 300 MB/s will take about a month.

Alternatively, you could lie to your code about the state of the NAS. On Linux, you could write a user-space filesystem that pretended to have several petabytes of data by creating on-the fly metadata (filename, length etc) for a petabytes worth of files. When the product reads, then generate random data. When you product writes, write it to real disk and remember you've got "real" data if it's read again.

Since your product presumably won't read the whole petabyte during this test, nor write much of it, you could easily simulate an arbitrarily full NAS instantly.

Whether this takes more or less than a month to develop is an open question :)

How do I quickly fill up a multi-petabyte NAS?

Answers (1)

Related Questions