Reputation: 313
I created a ~3TB binary file (located on an AWS EBS volume) intended to store an MxN matrix of doubles representing uniform financial time series across multiple days. There are M=37932 different time series, each of which has N=10415118 elements.
I have a C++ program that reads in financial market data for a specific date, creates M file pointers that point to the appropriate starting locations within the aforementioned binary file, and then writes in the desired time series data at the location of the correct file pointer as it processes the financial market data.
I am using a 72-core AWS EC2 instance running Ubuntu 16.04, and was running the above C++ program in 54 processes in parallel at a time (with a total of several hundred dates to go through overall). So in total, about 54*37932=2048328 file pointers were open at once on the system.
After some time, the processes began to get stuck in the uninterruptible sleep "D state" and just hung. Does anyone know why this could be? This issue tends to come up less often when I run fewer of the aforementioned processes in parallel.
I also noticed this for the EBS volume, maybe it is causing a problem? I'm not sure if it is meaningful for an EBS volume and if/how it should be fixed.
$ sudo xfs_db -c frag -r /dev/nvme2n1
actual 1468060, ideal 16154, fragmentation factor 98.90%
(not sure if this would be more appropriate for ServerFault instead)
Upvotes: 0
Views: 1254
Reputation: 12668
An uninterruptible D state is entered when a disk driver is seeking for some data in the disk and the disk has to be waited for for the process to continue. Normally a process sticking on a D state is a consequence of bad hardware (which should not happen to you in the platform you are using) but the worst thing is to have log data amounting for three terabytes in just one file. This is not only weird, but it forces you to die on a hardware failing because you have all eggs in the same basket. It will be better if you describe your data and adapt it to this huge in probably several directories with historical data. A set of text files describing your matrix will be far more secure and reliable ways to store data, and probably you can even do good compression if you just think on the data structure a bit.
Cannot help much more, as you just describe the fabulous system you have ordered to amazon services to handle your matrix... but you don't describe the things you have stored on it. It's impossible to help you but to recommend to ask for a bigger computer to Amazon, as the one you have now is unable to handle your data at all. A better reorganization of the data could probably improve your mesh, but you have just ended in describing a fabulous system that is underused completely.
Upvotes: 2