how to improve performance by avoiding flatmap operation in apache spark

Question

I am runnign a set of rules against the my java itemObjects. For each item, I am processing the list of the rules.

Normally I have 1 million items and 100 rules.

Currently running this program in spark is taking 15 mins.

I observed that faltMaptopair is taking more time. I want to improve the performance of this program.

Get the rules
map each item against the list of rules and produce result set
return JavaPairRDD of itemId and List

Any suggestions of refactor this code to improve performance further

I have written the following code.

public JavaPairRDD> validate() {       
        List> rules = ruleWrapper.getRulesList().collect();
        JavaPairRDD> resultsPairRDD = itemsForValidation
                .map(x -> getRulesResult(rules, x))
                .flatMapToPair(this::mapToRuleResultById)
                .aggregateByKey(
                        MapperUtil.newList(),
                        MapperUtil::addToList,
                        MapperUtil::combineLists
                );      
        return resultsPairRDD;
    }

    private List> mapToRuleResultById(List ruleResults) {
        return ruleResults.stream()
                .map(ruleResult -> new Tuple2<>(ruleResult.getItemId(), ruleResult))
                .collect(toList());
    }

    private List getRulesResult(List> rules, T x) {
        return rules.stream()
                .map(rule -> rule.execute(x)).collect(toList());
    }

    public  RuleResult execute(T t){
    //get the rule result
    }

    public class RuleResult{
        private String itemId;
    }

how to improve performance by avoiding flatmap operation in apache spark

Answers (1)

Related Questions