Louis Luk
Louis Luk

Reputation: 303

Read data, update then write back to DB by Spark

I'm working on data processing using spark and cassandra.

What I want to do is read and load the data from cassandra first. Process the data and write them back to cassandra.

When spark does the map function, an error occurs - Row is read-only <class 'Exception'>

Here is my method. Showing as the below

def detect_image(image_attribute):
    image_id = image_attribute['image_id']
    image_url = image_attribute['image_url']

    if image_attribute['status'] is None:
         image_attribute['status'] = Status()
    image_attribute['status']['detect_count'] += 1

    ... # the other item assignment

cassandra_data = sql_context.read.format("org.apache.spark.sql.cassandra").options(table="photo",
                                                                                         keyspace="data").load()

cassandra_data_processed = cassandra_data.rdd.map(process_batch_image)

cassandra_data_processed.toDF().write \
        .format("org.apache.spark.sql.cassandra") \
        .mode('overwrite') \
        .options(table="photo", keyspace="data") \
        .save()

The error of Row is read-only <class 'Exception'> are in line image_attribute['status'] = Status() and image_attribute['status']['detect_count'] += 1

is it necessary to copy the image_attribute to be a new object? However, the image_attribute is a nested objects. It will be so hard to copy one by one layer.

Upvotes: 1

Views: 505

Answers (1)

TobiSH
TobiSH

Reputation: 2921

Your suggestion is absolutely right. The map function converts an incoming type to another type. That is at least thr intention. The incoming object is immutable to make this operation idempotent. I guess there is no way around copying the image objects (manually or using something like deepcopy)

Hope that helps

Upvotes: 1

Related Questions