Best practice for key with two values

Question

So far I have a JavaDStream which first looked like this:

Value
---------------------
a,apple,spain
b,orange,italy
c,apple,italy
a,apple,italy
a,orange,greece

First i splitted up the rows and mapped it to a Key-Value pair in a JavaPairDStream:

JavaPairDStream pairDStream = inputStream.mapToPair(row -> {
    String[] cols = row.split(",");
    String key = cols[0];
    String value = cols[1] + "," + cols[2];

    return new Tuple2(key, value);
});

So that i got this:

Key  | Value
---------------------
a    | apple,spain
b    | orange,italy
c    | apple,italy
a    | apple,italy
a    | orange,greece

In the end, the output should look like this

Key  | Fruit | Country
-------------------------------
a    | 2     | 3
b    | 1     | 1
c    | 1     | 1

which counts the number of unique fruits and countries of each key.

What is now the best practice? First, groupByKey/reduceByKey then split again? or is it possible to have two values for each key in a Key-Value pair like this?:

Key  | Value1 | Value2
----------------------
a    | apple  | spain
b    | orange | italy
c    | apple  | italy
a    | apple  | italy
a    | orange | greece

Best practice for key with two values

Answers (1)

Related Questions