Reputation: 1265
I have a use case in which i have a set of data (Eg: A csv file containing around 10 million of rows and around 25 columns ). and i have a set of rules(around 1000 rules) using that i need to update records, and these rules have to execute sequentially.
i wrote a code in which i am looping for every rule and for each rule i updating data.
suppose rule is like
col1=5 and col2=10 then col25=updatedValue
rulesList.foreach(rule=> {
var data = data.map(line(col1, col2, .., col25) => if(rule){
line(col1, col2, .., updatedValue)
} else {line(col1, col2, .., col25)})
})
these rules will execute sequential and finally a will get updated records.
But problem is that if rules and data is less that it is executing properly but if data is large than i gets StackOverflow Error, Reason may be because it is mapping for all rules and executing it last like map-reduce.
Is there any way using which i can update this data incremently.
Upvotes: 1
Views: 1398
Reputation: 17872
Given a record in the RDD, if you can apply all updates incremently to it but independently of the other records, I would suggest you do the map first and then you iterate through the rulesList inside the map:
val result = data.map { case line(col1, col2, ..., col25) =>
var col25_mutable = col25
rulesList.foreach{ rule =>
col25_mutable = if(rule) updatedValue else col25_mutable
}
line(col1, col2, ..., col25_mutable)
}
This approach should be thread-safe if rulesList is a simple iterable object, such as Array or List.
I hope it works for you, or that it at least helps you achieve your goal.
Cheers
Upvotes: 0
Reputation: 25909
Try mapping once over the RDD and loop over the rules inside the map a lot less data movement. All the rules will be applied locally at the data resulting in the updated record - instead of creating 1000 RDDs
Upvotes: 2