Reputation: 579
I am trying to run a parallel process in python, wherein I have to extract certain polygons from a large array based on some conditions. The large array has 10k+ polygons that are indexed.
In a extract_polygon
function I pass (array, index). Based on index the function has to either return the polygon corresponding to that index or not based on the conditions defined. The array is never changed and is only used for reading the polygon based on the index provided.
Since the array is very large, I am running into out of memory error during parallel processing. how can I avoid that? (In a way, how to effectively use shared array in multiprocessing?)
Below is my sample code:
def extract_polygon(array, index):
try:
islays = ndimage.find_objects(clone==index)
poly = clone[islays[0][0],islays[0][1]]
area = np.count_nonzero(ploy)
minArea = 100
maxArea = 10000
if (area > minArea) and (area < maxArea):
return poly
else:
return None
except:
return None
start = time.time()
pool = mp.Pool(10)
results = pool.starmap(get_objects,[(array, index) for index in indices])
pool.close()
pool.join()
#indices here is a list of all the indexes we have.
Can I use any other library like ray
in this case?
Upvotes: 1
Views: 2410
Reputation: 3362
You can absolutely use a library like Ray.
The structure would look something like this (simplified to remove your application logic).
import numpy as np
import ray
ray.init()
# Create the array and store it in shared memory once.
array = np.ones(10**6)
array_id = ray.put(array)
@ray.remote
def extract_polygon(array, index):
# Change this to actual extract the polygon.
return index
# Start 10 tasks that each take in the ID of the array in shared memory.
# These tasks execute in parallel (assuming there are enough CPU resources).
result_ids = [extract_polygon.remote(array_id, i) for i in range(10)]
# Fetch the results.
results = ray.get(result_ids)
You can read more about Ray in the documentation.
See some related answers below:
Upvotes: 3