OSError 24 when opening FITS with Astropy

Question

First things first, I've already read the following:

And some more links from the first one, but none of them worked...

My problem is with opening huge (>80 Mb/pc.) and numerous (~3000) FITS files in Jupyter Notebook. The relevant code snippet is the following:

# Dictionary to store NxN data matrices of cropped image tiles
CroppedObjects = {}

# Defining some other, here used variable....
# ...

# Interate over all images ('j'), which contain the current object, indexed by 'i'
for i in range(0, len(finalObjects)):
    for j in range(0, len(containingImages[containedObj[i]])):

        countImages += 1

        # Path to the current image: 'mnt/...'
        current_image_path = ImagePaths[int(containingImages[containedObj[i]][j])]

        # Open .fits images
        with fits.open(current_image_path, memmap=False) as hdul:
            # Collect image data
            image_data = fits.getdata(current_image_path)

            # Collect WCS data from the current .fits's header
            ImageWCS = wcs.WCS(hdul[1].header)

            # Cropping parameters:
            # 1. Sky-coordinates of the croppable object
            # 2. Size of the crop, already defined above
            Coordinates = coordinates.SkyCoord(finalObjects[i][1]*u.deg,finalObjects[i][2]*u.deg, frame='fk5')
            size = (cropSize*u.pixel, cropSize*u.pixel)

            try:
                # Cut out the image tile
                cutout = Cutout2D(image_data, position=Coordinates, size=size, wcs=ImageWCS, mode='strict')

                # Write the cutout to a new FITS file
                cutout_filename = "Cropped_Images_Sorted/Cropped_" + str(containedObj[i]) + current_image_path[-23:]

                # Sava data to dictionary
                CroppedObjects[cutout_filename] = cutout.data

                foundImages += 1

            except:
                pass

            else:
                del image_data
                continue

        # Memory maintainance                
        gc.collect()

        # Progress bar
        sys.stdout.write("
Progress: [{0}{1}] {2:.3f}%	Elapsed: {3}	Remaining: {4}  {5}".format(u'\u2588' * int(countImages/allCrops * progressbar_width),
                                                                                                   u'\u2591' * (progressbar_width - int(countImages/allCrops * progressbar_width)),
                                                                                                   countImages/allCrops * 100,
                                                                                                   datetime.now()-starttime,
                                                                                                   (datetime.now()-starttime)/countImages * (allCrops - countImages),
                                                                                                   foundImages))

        sys.stdout.flush()

So ok, it does actually three things:

Opens a particular FITS file
Cuts a square out of it (but strictly, so if arrays only overlap partly, the try statement jumps to the next step in the loop)
Updates the progress bar

Then goes to the next file, does the same things and iterates over all of my FITS files.

BUT: If I try running this code, after approximately 1000 found pictures, it stops and gives and OSError: [Errno 24] Too many open files on the line:

image_data = fits.getdata(current_image_path)

I tried everything, which was supposed to solve the problem, but nothing helped... Not even setting memory mapping to false or using fits.getdata and gc.collect()... Also tried many minor changes, like running without the try statement, cutting out all of the image tiles, without any limitations. The del inside the else statement is also another miserable attempt by me. What else can I try to make this finally work?
Also, feel free to ask me if somethings not clear! I'll also try to help you to understand the problem!

Iguananaut · Accepted Answer

This line is what's hurting you:

image_data = fits.getdata(current_image_path)

You've just opened that file on the previous line with memmap=False, but with that line you're re-opening it with memmap=True and holding the file open when you keep a reference to image_data by way of wrapping it in a Cutout2D and then keeping a reference to the data with:

CroppedObjects[cutout_filename] = cutout.data

As far as I know, Cutout2D doesn't necessarily make a copy of the data if it doesn't have to, so you're still effectively holding open a reference to image_data which is mmap'd.

Solution: Don't use fits.getdata here. See the warning about this in the docs:

These functions are useful for interactive Python sessions and simple analysis scripts, but should not be used for application code, as they are highly inefficient. For example, each call to getval() requires re-parsing the entire FITS file. Code that makes repeated use of these functions should instead open the file with open() and access the data structures directly.

So in your case you want to replace the line:

image_data = fits.getdata(current_image_path)

with

image_data = hdul[1].data

As @Christoph wrote in his answer, get rid of all the del image_data and gc.collect() stuff since it's not helping you anyways.

Addendum: From the API docs for Cutout2D:

If False (default), then the cutout data will be a view into the original data array. If True, then the cutout data will hold a copy of the original data array.

So this is stating explicitly (and I confirmed this with a peek at the code) that the Cutout2D is just taking a view of the original data array, meaning it's holding on to a reference to it. You could probably avoid this, if you want, by calling Cutout2D(..., copy=True). If you did that, you could probably do away with the memmap=False as well. Using mmap may or may not be useful: It depends in part on the sizes of the images and how much physical RAM you have available. In your case it might be faster since you are not using the entire images, and are just taking cutouts of them. This means that using memmap=True might be more efficient as it may allow avoiding paging the entire image array into memory.

But this could also also depends on a lot of things, so you might want to do some performance testing with fits.open(..., memmap=False)+Cutout2D(..., copy=False) versus fits.open(..., memmap=True)+Cutout2D(..., copy=True), perhaps with a smaller number of files.

OSError 24 when opening FITS with Astropy

Answers (2)

Related Questions