Reputation: 143
I have an RDD of custom objects, let's say Person
. then I use several narrow (although could be wide) transformations on this RDD, each time I get a new RDD. finally I get an RDD with a different type, let's say an Integer
.
Now I want to know in some way what Integer
linked to each Person
, and to print it like this:
person a -> 3
person b -> 1
person c -> 7
I tried: JavaPairRDD resultRDD = myRDD.mapToPair(rec -> new Tuple2(rec, new SomeFunction.call(rec)));
this code works for me because I can get each tuple and print it. but I'm not sure if it is a good way to implement this when there are many transformations (is it?)
thought to use another option: transformedRDD.parent(number,evidence)
and in that way get the original RDD and then some how to identify the reference between the Person and the Integer.
Note: evidence
is scala.reflect.ClassTag<U>
and I am not familiar with scala so I don't really understand what to write there
Any help will be appreciated!
Upvotes: 0
Views: 175
Reputation: 143
After some experiments I decided to use the following solution:
JavaRDD<Person> persons = sc.parallelize(personList);
JavaRDD<Person,SomeType> trans1 = persons.mapToPair(p -> new Tuple2<Person,SomeType>(p, someFunction.call(p)));
JavaRDD<Person,OtherType> trans2 = trans1.mapToPair(tuple -> new Tuple2<Person,OtherType>(tuple._1(), otherFunction.call(tuple._2())));
you could continue as much as you want, and you always have a reference to the Person object. It can be done in more concise way with .mapToPair
without declaring other RDDs but for me it is more clear like this.
Upvotes: 0
Reputation: 690
I would simply carry a key with me all the way.this way its easier to avoid miss identification as each object comes with its id every time. in other words:
persons
.map(p => (id, p))
.map( (id, p) => (id, transformation1(p)) )
.map( (id, p) => (id, transformation2(p)) )
....
Upvotes: 1
Reputation: 56
I think there is no right or wrong answer to this question. There could be a better answer though.
You are on the right track to first think about making rdd to PairRDD. However as you said that there are many transformations on to the initial RDD structute, it gets complicated quickly.
Sorry about the bad drawing.. Anyway probably for multidependency, it is not very clear what to put at key field of PairRDD.
I am not sure if this is the case for you but I think if the relationship is not one to one, there could be many Persons which produces one Integer.
If you are using reduce operation to the Integer before you interpret the dependency information, you need to concern that an Integer may not have only one ancestor.
Anyway, I think the best way to solve this problem is that you add a unique identifier ArrayList field in the RDD. Instead of making a PairRDD, which adds unnecessary structure, just think about this field as a graph that denotes ancestry of current RDD's field.
For example, Persons object would have a field named "dependency" which is length 0 arraylist because it has no ancestor. After that, let's say that you have a transformation to Double for some reason. Then the resulting RDD contains a field named "dependency" which has length 1 which denotes the unique identifier field of Person object. Lastly we have transformation to Integer. Again we have a RDD with field named "dependency" which is length 2(because we had two ancestors for this one integer) that denotes unique identifier of Person object and unique identifier of Double object.
I think my explanation is bit lengthy and verbose but I hope you get the meaning..
Lastly if you are doing reduce operation between the RDDs, you have to consider if you really have one to one case. Because one Integer may not have come from the one Person object, if you want to discover full lineage of this Integer, you got to add all dependency information to the arraylist. Also when you decipher this "dependency" arraylist, you have to keep in mind that the length could be arbitrary for the list if the relationship is not one to one and if you are using reduce between the RDDs.
The best solution I thought was this one but I think there could be simpler answer to this question. If you find out one let me know!
Upvotes: 0