Save image from RDD Apache Spark system

Question

I want to retrieve the images that I have stored on my RDD system after I map them.

I created a simple Spark Session on my main.py file which calls the function preprocess_spark that returns an array of tuples named samples. These tuples are in the (slide_num, image) format. Image is an np.array that will be converted to an image in the save_jpeg_help function.

When I open the Apache Spark WEB UI I see that it has a job corresponding to the line:

rdd.foreach(lambda sample_element: save_nonlabelled_sample_2_jpeg(sample_element, save_dir))

but when it finishes nothing is ever saved on my save_dir directory.

Any idea what I'm doing wrong?

Kind regards

main.py

spark = (SparkSession.builder
     .appName("Oncofinder -- Preprocessing")
     .getOrCreate())

samples = preprocess_spark(spark, [1])

if save_jpegs: #SET TO TRUE
    save_rdd_2_jpeg(samples, './data/images')


def save_rdd_2_jpeg(rdd, save_dir):
    rdd.foreach(lambda sample_element: save_nonlabelled_sample_2_jpeg(sample_element, save_dir))


def save_nonlabelled_sample_2_jpeg(sample, save_dir):
    slide_num, img_value = sample
    filename = '{slide_num}_{hash}.jpeg'.format(
        slide_num=slide_num, hash=np.random.randint(1e4))
    filepath = os.path.join(save_dir, filename)
    save_jpeg_help(img_value, filepath)

def save_jpeg_help(img_value, filepath):
    dir = os.path.dirname(filepath)
    os.makedirs(dir, exist_ok=True)
    img = Image.fromarray(img_value.astype(np.uint8), 'RGB')
    img.save(filepath)


def preprocess_spark(spark, slide_nums, folder="data", training=False, tile_size=1024, overlap=0,
               tissue_threshold=0.9, sample_size=256, grayscale=False, normalize_stains=True,
               num_partitions=20000):

    slides = (spark.sparkContext
              .parallelize(slide_nums)
              .filter(lambda slide: open_slide(slide, folder, training) is not None))
    tile_indices = (slides.flatMap(
        lambda slide: process_slide(slide, folder, training, tile_size, overlap)))
    tile_indices = tile_indices.repartition(num_partitions)
    tile_indices.cache()

    tiles = tile_indices.map(lambda tile_index: process_tile_index(tile_index, folder, training))
    filtered_tiles = tiles.filter(lambda tile: keep_tile(tile, tile_size, tissue_threshold))
    samples = filtered_tiles.flatMap(lambda tile: process_tile(tile, sample_size, grayscale))
    if normalize_stains:
        samples = samples.map(lambda sample: normalize_staining(sample))

    return samples

EDIT: I'm using

PYSPARK_PYTHON=python3 spark-submit --master spark://127.0.1.1:7077 spark_preprocessing.py

to run the application. It seems that after the foreach action, nothing else happens. Is there any reason for that?

tel · Accepted Answer

You can fix the issue you're having if you collect all of your samples on to the driver node before you try to save them. If you redefine save_rdd_2_jpeg as:

def save_rdd_2_jpeg(rdd, save_dir):
    for sample in rdd.collect():
        save_nonlabelled_sample_2_jpeg(sample, save_dir)

then everything should work.

Save image from RDD Apache Spark system

Answers (1)

Related Questions