Imagemagick: Optimizing the speed for identification of truncated images

Question

I am using imagemagick to identify the premature ending of truncated images in a folder. The script I wrote successfully identifies the images however, it is very slow. This is likely because it has to load the whole image into memory, but given the time it took me to copy my files to the disk this should add no more than a few hours to the operation. I am analyzing over 700,000 images and the at the current speed the operation will take over a month to complete, not to mention the extremely high CPU usage.

foreach (string f in files)
{
    Tuple result = ImageCorrupt(f);
    int exitCode = result.Item1;
    if (exitCode != 0)...
}

public static Tuple ImageCorrupt(string pathToImage)
{
    var cmd = "magick identify -regard-warnings -verbose  \"" + pathToImage + "\"";

    var startInfo = new ProcessStartInfo
    {
        WindowStyle = ProcessWindowStyle.Hidden,
        FileName = "cmd.exe",
        Arguments = "/C " + cmd,
        UseShellExecute = false,
        RedirectStandardOutput = true,
        RedirectStandardError = true
    };

    var process = new Process
    {
        StartInfo = startInfo
    };

    process.Start();
    string output = process.StandardOutput.ReadToEnd();

    if (!process.WaitForExit(30000))
    {
        process.Kill();
    }

    return Tuple.Create(process.ExitCode, process.StandardError.ReadToEnd());
}

Here is an example of the problem I am trying to identify in the images.

Is there a way to optimize my script for performance? or is there a faster way to identify the problem with the images?

jcupitt · Accepted Answer

You could try net-vips. It won't spot as many image formats as imagemagick but it will do the basic TIF/JPG/PNG/GIF etc. and it is quite a bit quicker.

I would test images by calculating the average pixel value. That way you are guaranteed to read every pixel, and the operation is cheap.

I don't actually have a C# install here, but in pyvips (the Python binding to the same library as net-vips), it would be:

import sys
import pyvips

for filename in sys.argv[1:]:
    try:
        # the fail option makes pyvips throw an exception on a file
        # format error
        # sequential access means libvips will stream the image rather than
        # loading it into memory
        image = pyvips.Image.new_from_file(filename,
                                           fail=True, access="sequential")
        avg = image.avg()
    except pyvips.Error as e:
        print("{}: {}".format(filename, e))

I can run it like this:

$ for i in {1..1000}; do cp ~/pics/k2.jpg $i.jpg; done
$ cp ~/pics/k2_broken.jpg .
$ vipsheader 1.jpg
1.jpg: 1450x2048 uchar, 3 bands, srgb, jpegload

That's one broken image, 1,000 OK images, all 1450x2048. Then:

$ time ../sanity.py *.jpg
k2_broken.jpg: unable to call avg
  VipsJpeg: Premature end of JPEG file
VipsJpeg: out of order read at line 48
real    0m23.424s

So on this modest laptop it found the broken image in 23s.

Your loop with identify (though only testing 100 images) would be:

$ time for i in {1..100}; do if ! identify -regard-warnings -verbose $i.jpg > /dev/null; then echo $i: error; fi; done
real        0m21.454s

About the same length of time, therefore net-vips is around 10x faster on this test.

Because net-vips is relatively frugal with memory, you can also run quite a few at once, depending on how many cores you have. This should give an almost linear speedup.

On this two core, four thread laptop I see:

$ time parallel --jobs 10 -m ../sanity.py ::: *.jpg
k2_broken.jpg: unable to call avg
  VipsJpeg: Premature end of JPEG file
VipsJpeg: out of order read at line 48
real    0m10.828s

Down to 11s for 1,001 images now.

Imagemagick: Optimizing the speed for identification of truncated images

Answers (1)

Related Questions