Read data, update then write back to DB by Spark

Question

I'm working on data processing using spark and cassandra.

What I want to do is read and load the data from cassandra first. Process the data and write them back to cassandra.

When spark does the map function, an error occurs - Row is read-only

Here is my method. Showing as the below

def detect_image(image_attribute):
    image_id = image_attribute['image_id']
    image_url = image_attribute['image_url']

    if image_attribute['status'] is None:
         image_attribute['status'] = Status()
    image_attribute['status']['detect_count'] += 1

    ... # the other item assignment

cassandra_data = sql_context.read.format("org.apache.spark.sql.cassandra").options(table="photo",
                                                                                         keyspace="data").load()

cassandra_data_processed = cassandra_data.rdd.map(process_batch_image)

cassandra_data_processed.toDF().write \
        .format("org.apache.spark.sql.cassandra") \
        .mode('overwrite') \
        .options(table="photo", keyspace="data") \
        .save()

The error of Row is read-only are in line image_attribute['status'] = Status() and image_attribute['status']['detect_count'] += 1

is it necessary to copy the image_attribute to be a new object? However, the image_attribute is a nested objects. It will be so hard to copy one by one layer.

TobiSH · Accepted Answer

Your suggestion is absolutely right. The map function converts an incoming type to another type. That is at least thr intention. The incoming object is immutable to make this operation idempotent. I guess there is no way around copying the image objects (manually or using something like deepcopy)

Hope that helps

Read data, update then write back to DB by Spark

Answers (1)

Related Questions