Incremental update in rdd or dataframe apache spark

I have a use case in which i have a set of data (Eg: A csv file containing around 10 million of rows and around 25 columns ). and i have a set of rules(around 1000 rules) using that i need to update records, and these rules have to execute sequentially.

i wrote a code in which i am looping for every rule and for each rule i updating data.

suppose rule is like

col1=5 and col2=10 then col25=updatedValue

rulesList.foreach(rule=> {
    var data = data.map(line(col1, col2, .., col25) => if(rule){
        line(col1, col2, .., updatedValue)
    } else {line(col1, col2, .., col25)})
})

these rules will execute sequential and finally a will get updated records.

But problem is that if rules and data is less that it is executing properly but if data is large than i gets StackOverflow Error, Reason may be because it is mapping for all rules and executing it last like map-reduce.

Is there any way using which i can update this data incremently.

Upvotes: 1

Answers (2)

Daniel de Paula

Reputation: 17872

Given a record in the RDD, if you can apply all updates incremently to it but independently of the other records, I would suggest you do the map first and then you iterate through the rulesList inside the map:

val result = data.map { case line(col1, col2, ..., col25) => 
    var col25_mutable = col25
    rulesList.foreach{ rule => 
        col25_mutable = if(rule) updatedValue else col25_mutable
    }
    line(col1, col2, ..., col25_mutable)
}

This approach should be thread-safe if rulesList is a simple iterable object, such as Array or List.

I hope it works for you, or that it at least helps you achieve your goal.

Cheers

Upvotes: 0

Arnon Rotem-Gal-Oz

Reputation: 25909

Try mapping once over the RDD and loop over the rules inside the map a lot less data movement. All the rules will be applied locally at the data resulting in the updated record - instead of creating 1000 RDDs

Upvotes: 2

Incremental update in rdd or dataframe apache spark

Answers (2)

Related Questions