masterdesky
masterdesky

Reputation: 63

OSError 24 when opening FITS with Astropy

First things first, I've already read the following:

And some more links from the first one, but none of them worked...

My problem is with opening huge (>80 Mb/pc.) and numerous (~3000) FITS files in Jupyter Notebook. The relevant code snippet is the following:

# Dictionary to store NxN data matrices of cropped image tiles
CroppedObjects = {}

# Defining some other, here used variable....
# ...

# Interate over all images ('j'), which contain the current object, indexed by 'i'
for i in range(0, len(finalObjects)):
    for j in range(0, len(containingImages[containedObj[i]])):

        countImages += 1

        # Path to the current image: 'mnt/...'
        current_image_path = ImagePaths[int(containingImages[containedObj[i]][j])]

        # Open .fits images
        with fits.open(current_image_path, memmap=False) as hdul:
            # Collect image data
            image_data = fits.getdata(current_image_path)

            # Collect WCS data from the current .fits's header
            ImageWCS = wcs.WCS(hdul[1].header)

            # Cropping parameters:
            # 1. Sky-coordinates of the croppable object
            # 2. Size of the crop, already defined above
            Coordinates = coordinates.SkyCoord(finalObjects[i][1]*u.deg,finalObjects[i][2]*u.deg, frame='fk5')
            size = (cropSize*u.pixel, cropSize*u.pixel)

            try:
                # Cut out the image tile
                cutout = Cutout2D(image_data, position=Coordinates, size=size, wcs=ImageWCS, mode='strict')

                # Write the cutout to a new FITS file
                cutout_filename = "Cropped_Images_Sorted/Cropped_" + str(containedObj[i]) + current_image_path[-23:]

                # Sava data to dictionary
                CroppedObjects[cutout_filename] = cutout.data

                foundImages += 1

            except:
                pass

            else:
                del image_data
                continue

        # Memory maintainance                
        gc.collect()

        # Progress bar
        sys.stdout.write("\rProgress: [{0}{1}] {2:.3f}%\tElapsed: {3}\tRemaining: {4}  {5}".format(u'\u2588' * int(countImages/allCrops * progressbar_width),
                                                                                                   u'\u2591' * (progressbar_width - int(countImages/allCrops * progressbar_width)),
                                                                                                   countImages/allCrops * 100,
                                                                                                   datetime.now()-starttime,
                                                                                                   (datetime.now()-starttime)/countImages * (allCrops - countImages),
                                                                                                   foundImages))

        sys.stdout.flush()

So ok, it does actually three things:

  1. Opens a particular FITS file
  2. Cuts a square out of it (but strictly, so if arrays only overlap partly, the try statement jumps to the next step in the loop)
  3. Updates the progress bar

Then goes to the next file, does the same things and iterates over all of my FITS files.

BUT: If I try running this code, after approximately 1000 found pictures, it stops and gives and OSError: [Errno 24] Too many open files on the line:

image_data = fits.getdata(current_image_path)

I tried everything, which was supposed to solve the problem, but nothing helped... Not even setting memory mapping to false or using fits.getdata and gc.collect()... Also tried many minor changes, like running without the try statement, cutting out all of the image tiles, without any limitations. The del inside the else statement is also another miserable attempt by me. What else can I try to make this finally work?
Also, feel free to ask me if somethings not clear! I'll also try to help you to understand the problem!

Upvotes: 1

Views: 810

Answers (2)

Iguananaut
Iguananaut

Reputation: 23306

This line is what's hurting you:

image_data = fits.getdata(current_image_path)

You've just opened that file on the previous line with memmap=False, but with that line you're re-opening it with memmap=True and holding the file open when you keep a reference to image_data by way of wrapping it in a Cutout2D and then keeping a reference to the data with:

CroppedObjects[cutout_filename] = cutout.data

As far as I know, Cutout2D doesn't necessarily make a copy of the data if it doesn't have to, so you're still effectively holding open a reference to image_data which is mmap'd.

Solution: Don't use fits.getdata here. See the warning about this in the docs:

These functions are useful for interactive Python sessions and simple analysis scripts, but should not be used for application code, as they are highly inefficient. For example, each call to getval() requires re-parsing the entire FITS file. Code that makes repeated use of these functions should instead open the file with open() and access the data structures directly.

So in your case you want to replace the line:

image_data = fits.getdata(current_image_path)

with

image_data = hdul[1].data

As @Christoph wrote in his answer, get rid of all the del image_data and gc.collect() stuff since it's not helping you anyways.

Addendum: From the API docs for Cutout2D:

If False (default), then the cutout data will be a view into the original data array. If True, then the cutout data will hold a copy of the original data array.

So this is stating explicitly (and I confirmed this with a peek at the code) that the Cutout2D is just taking a view of the original data array, meaning it's holding on to a reference to it. You could probably avoid this, if you want, by calling Cutout2D(..., copy=True). If you did that, you could probably do away with the memmap=False as well. Using mmap may or may not be useful: It depends in part on the sizes of the images and how much physical RAM you have available. In your case it might be faster since you are not using the entire images, and are just taking cutouts of them. This means that using memmap=True might be more efficient as it may allow avoiding paging the entire image array into memory.

But this could also also depends on a lot of things, so you might want to do some performance testing with fits.open(..., memmap=False)+Cutout2D(..., copy=False) versus fits.open(..., memmap=True)+Cutout2D(..., copy=True), perhaps with a smaller number of files.

Upvotes: 3

Christoph
Christoph

Reputation: 2940

I've had a similar issue in the past (see here). In the end I made it work roughly like this:

total = 0
for filename in filenames:
    with fits.open(filename, memmap=False) as hdulist:
        data = hdulist['spam'].data
    total += data.sum()

Some notes:

  • use fits.open to open the file, with memmap=False
  • use it in a with block, to make the file closing reliable
  • keep the with block short, just load the data you need into memory, then close the file by exiting it
  • do what you need to do with the data after the file is closed; that might not really be needed, but if Python references to data in the file are the issue that prevent it from being closed, this simplifies the situation. I don't think the cutout code is the problem in your example, but it could be - try uncommenting it?
  • don't do an extra fits.getdata which I think opens the file again
  • the del and gc.collect shouldn't be needed, if the code is simple as suggested here, there will not be circular references and Python will reliably delete objects at the end of scopes

Now it's possible that this will not help, you'll still have the issue. In that case, the way to proceed is to make a minimal reproducible example that doesn't work for you that Astropy devs can run (like I did here), and then to file an issue with Astropy, giving your Python version, Astropy version and operating system, or post it here. The point is: this is complex and likely runtime / version dependent, so to try to pin it down an example anyone can run, that fails for you, is needed.

Upvotes: 3

Related Questions