Reputation: 63
First things first, I've already read the following:
And some more links from the first one, but none of them worked...
My problem is with opening huge (>80 Mb/pc.) and numerous (~3000) FITS files in Jupyter Notebook. The relevant code snippet is the following:
# Dictionary to store NxN data matrices of cropped image tiles
CroppedObjects = {}
# Defining some other, here used variable....
# ...
# Interate over all images ('j'), which contain the current object, indexed by 'i'
for i in range(0, len(finalObjects)):
for j in range(0, len(containingImages[containedObj[i]])):
countImages += 1
# Path to the current image: 'mnt/...'
current_image_path = ImagePaths[int(containingImages[containedObj[i]][j])]
# Open .fits images
with fits.open(current_image_path, memmap=False) as hdul:
# Collect image data
image_data = fits.getdata(current_image_path)
# Collect WCS data from the current .fits's header
ImageWCS = wcs.WCS(hdul[1].header)
# Cropping parameters:
# 1. Sky-coordinates of the croppable object
# 2. Size of the crop, already defined above
Coordinates = coordinates.SkyCoord(finalObjects[i][1]*u.deg,finalObjects[i][2]*u.deg, frame='fk5')
size = (cropSize*u.pixel, cropSize*u.pixel)
try:
# Cut out the image tile
cutout = Cutout2D(image_data, position=Coordinates, size=size, wcs=ImageWCS, mode='strict')
# Write the cutout to a new FITS file
cutout_filename = "Cropped_Images_Sorted/Cropped_" + str(containedObj[i]) + current_image_path[-23:]
# Sava data to dictionary
CroppedObjects[cutout_filename] = cutout.data
foundImages += 1
except:
pass
else:
del image_data
continue
# Memory maintainance
gc.collect()
# Progress bar
sys.stdout.write("\rProgress: [{0}{1}] {2:.3f}%\tElapsed: {3}\tRemaining: {4} {5}".format(u'\u2588' * int(countImages/allCrops * progressbar_width),
u'\u2591' * (progressbar_width - int(countImages/allCrops * progressbar_width)),
countImages/allCrops * 100,
datetime.now()-starttime,
(datetime.now()-starttime)/countImages * (allCrops - countImages),
foundImages))
sys.stdout.flush()
So ok, it does actually three things:
strict
ly, so if arrays only overlap partly, the try
statement jumps to the next step in the loop)Then goes to the next file, does the same things and iterates over all of my FITS files.
BUT: If I try running this code, after approximately 1000 found pictures, it stops and gives and OSError: [Errno 24] Too many open files
on the line:
image_data = fits.getdata(current_image_path)
I tried everything, which was supposed to solve the problem, but nothing helped... Not even setting memory mapping to false
or using fits.getdata
and gc.collect()
... Also tried many minor changes, like running without the try
statement, cutting out all of the image tiles, without any limitations. The del
inside the else statement is also another miserable attempt by me.
What else can I try to make this finally work?
Also, feel free to ask me if somethings not clear! I'll also try to help you to understand the problem!
Upvotes: 1
Views: 810
Reputation: 23306
This line is what's hurting you:
image_data = fits.getdata(current_image_path)
You've just opened that file on the previous line with memmap=False
, but with that line you're re-opening it with memmap=True
and holding the file open when you keep a reference to image_data
by way of wrapping it in a Cutout2D
and then keeping a reference to the data with:
CroppedObjects[cutout_filename] = cutout.data
As far as I know, Cutout2D
doesn't necessarily make a copy of the data if it doesn't have to, so you're still effectively holding open a reference to image_data
which is mmap'd.
Solution: Don't use fits.getdata
here. See the warning about this in the docs:
These functions are useful for interactive Python sessions and simple analysis scripts, but should not be used for application code, as they are highly inefficient. For example, each call to
getval()
requires re-parsing the entire FITS file. Code that makes repeated use of these functions should instead open the file withopen()
and access the data structures directly.
So in your case you want to replace the line:
image_data = fits.getdata(current_image_path)
with
image_data = hdul[1].data
As @Christoph wrote in his answer, get rid of all the del image_data
and gc.collect()
stuff since it's not helping you anyways.
Addendum: From the API docs for Cutout2D:
If
False
(default), then the cutout data will be a view into the original data array. IfTrue
, then the cutout data will hold a copy of the original data array.
So this is stating explicitly (and I confirmed this with a peek at the code) that the Cutout2D
is just taking a view of the original data array, meaning it's holding on to a reference to it. You could probably avoid this, if you want, by calling Cutout2D(..., copy=True)
. If you did that, you could probably do away with the memmap=False
as well. Using mmap may or may not be useful: It depends in part on the sizes of the images and how much physical RAM you have available. In your case it might be faster since you are not using the entire images, and are just taking cutouts of them. This means that using memmap=True
might be more efficient as it may allow avoiding paging the entire image array into memory.
But this could also also depends on a lot of things, so you might want to do some performance testing with fits.open(..., memmap=False)
+Cutout2D(..., copy=False)
versus fits.open(..., memmap=True)
+Cutout2D(..., copy=True)
, perhaps with a smaller number of files.
Upvotes: 3
Reputation: 2940
I've had a similar issue in the past (see here). In the end I made it work roughly like this:
total = 0
for filename in filenames:
with fits.open(filename, memmap=False) as hdulist:
data = hdulist['spam'].data
total += data.sum()
Some notes:
fits.open
to open the file, with memmap=False
fits.getdata
which I think opens the file againdel
and gc.collect
shouldn't be needed, if the code is simple as suggested here, there will not be circular references and Python will reliably delete objects at the end of scopesNow it's possible that this will not help, you'll still have the issue. In that case, the way to proceed is to make a minimal reproducible example that doesn't work for you that Astropy devs can run (like I did here), and then to file an issue with Astropy, giving your Python version, Astropy version and operating system, or post it here. The point is: this is complex and likely runtime / version dependent, so to try to pin it down an example anyone can run, that fails for you, is needed.
Upvotes: 3