Eric
Eric

Reputation: 562

File structure to avoid data corruption

I´m currently developing an upgrade of our current media storage (To store video/audio/metadata) for a surveillance system and I´m redesigning the recording structure to a more robust solution.

I need to create some index data of the data stored in data files, so I´m creating an index file structure, but I´m concerned with hard disks failure (Imagine if the power is cut during the write of the index file, it will become corrup since the data will most likely be half written). I already designed how the index will be stored, but my concern is relative to data corruption on power failure or disk failure

So, do anyone know techniques to avoid data corruption upon writting?

I already searched a little and found no good solutions, one solution was to create a log of everything that is written to the file, but then I will have many more I/Os per second (I´m concerned with the amount of I/Os per second as well, the system should perform the least as possible).

What I came up with was to duplicate sensitive data in the index file along with a timestamp and checksum fields. For example:

Field1 Field2 Field3 Timestamp Checksum

Field1 Field2 Field3 Timestamp Checksum

So, I have the data written twice, if when I read the file, the first set of fields is corrupted (Checksum doesn´t match), I have the second set of fields that should be OK. I believe that corrupion happen when the writting if stopped in the middle, so, for example, when the software is writting the first set of fields and the power failure, the second set is still intact... if the power failures while the second set is being written, the first one is already intact.

What do you guys think of this solution? Does it avoid data corruption?

BTW, I can´t use any kind of database for this kind of storage or transactional NTFS due to the restrictions to deploy a system with transactional NTFS

Any ideas are welcome, thanks!

Upvotes: 4

Views: 1720

Answers (3)

CesarC
CesarC

Reputation: 88

It does not avoid data corruption, since corruption can happen on any one or both set of fields.

I think you are better without duplicating the "sensitive data" but Writing that data in two steps, on the first step Write the data with "checksum" field empty, and on a second step update the checksum with the one that match the data. This checksum is going to be used as "transaction committed" flag and to ensure data integrity.

When you read data you ignore all sets of the index that are not committed, i mean where the checksum doesn't match.

Then make a lot of testing, and fine tuning, force data corruption on every step of the process, and also save random data. I personally think testing needs a lot of work, since failure is random, that's why people recommend you to use databases tested for years.

Note that while it adds some protection against some kinds of data corruption, it's not perfect and you may add other layers of security to protect your data, including data replication, integrity checks and external configurations including no-breaks, raid systems, periodic backups.

There is too much theory around "transactions".

Search for "atomic transactions algorithms" to get more detail.

Reconsider using database, Reconsider using a log and even reconsider using the file system to store your info.

Upvotes: 2

perreal
perreal

Reputation: 98088

You can use some sort of transaction logic. Create the index in small chunks and first using a temporary file. When you finish one chunk (file), check for integrity and copy it as an actual index file if it passes the test. At this point you can distribute a few copies of the verified chunk.

Upvotes: 0

user496736
user496736

Reputation:

Ignoring the part of your question around not being able to use a database :)

You might find SQL Server 2012's FileTables of interest. You can store the files outside of the database in a folder but still access the files as if they were inside the database. You can use the database to insert new files to that directory or simply copy the file into the folder. Your database won't get really fat with the video files. Nor will they be in-accessible if the db server software went down. Your frame indexing could be individual .jpg files (or whatever) and those, too, could be referenced by a FileTable and index, via a foreign key, to the main video file. The frame index table then is very straight forward.

So you eliminate the DB overhead of writing the file and maintaining the log to see if there was a failure. If the OS can't write the file because of a power failure then the database won't stand a chance. You can do directory comparisons and use a robust utility to move the files around and not to remove the source file if any part of the write fails.

Upvotes: 2

Related Questions