Mohan
Mohan

Reputation: 473

How to do map of map processing in Spark

I have a csv as shown below,

T1,Data1,1278

T1,Data1,1279

T1,Data1,1280

T1,Data2,1283 

T1,Data2,1284  

T2,Data1,1278

T2,Data1,1290

I want to create JavaPairRdd as Map of Map like below

T1,[(Data1, (1278,1279,1280)), (Data2, (1283,1284))]
T2,[(Data1, (1278,1290))]

I tried to use combinebykey to create a JavaPairRDD using the below code

JavaPairRDD<Timestamp,List<Tuple2<String,List<Integer>>>> itemRDD = myrdd.mapToPair(new PairFunction<Row, Timestamp, Tuple2<String,Integer>>() {
    @Override
    public Tuple2<Timestamp, Tuple2<String, Integer>> call(Row row) throws Exception {
        Tuple2<Timestamp, Tuple2<String, Integer>> txInfo = new Tuple2<Timestamp, Tuple2<String, Integer>>(row.getTimestamp(0), new Tuple2<String, Integer>(row.getString(1), row.getInt(2)));
        return txInfo;
    }
}).combineByKey(createAcc,addItem,combine)

But I am not able to create a PairRdd like above. Whether my approach is correct? whether combinebykey can be used to create map of map in spark?

Upvotes: 0

Views: 41

Answers (1)

Yehor Krivokon
Yehor Krivokon

Reputation: 877

Try to use cogroup method instead of combineByKey.

Upvotes: 1

Related Questions