Reputation: 31
I defined a custom class Person
for my data and used groupByKey
operation as follows:
public class Person implements Serializable {
private static final long serialVersionUID = 1L;
private int personId;
private String name;
private String address;
public Person(int personId, String name, String address) {
this.personId = personId;
this.name = name;
this.address = address;
}
public int getPersonId() { return personId;}
public void setPersonId(int personId) { this.personId = personId;}
public String getName() { return name;}
public void setName(String name) { this.name = name;}
public String getAddress() { return address;}
public void setAddress(String address) { this.address = address;}
}
List<Person> personList = new ArrayList<Person>();
personList.add(new Person(111, "abc", "test1"));
personList.add(new Person(222, "def", "test2"));
personList.add(new Person(333, "fhg", "test3"));
personList.add(new Person(111, "jkl", "test4"));
personList.add(new Person(555, "mno", "test5"));
personList.add(new Person(444, "pqr", "test6"));
personList.add(new Person(111, "xyz", "test7"));
JavaRDD<Person> initialRDD = sc.parallelize(personList, 4);
JavaPairRDD<Person, Iterable<Person>> groupedBy =
initialRDD.cartesian(initialRDD).groupByKey();
But the result for this using the following does not do any grouping based on the keys.
groupedBy.foreach(x -> System.out.println(x._1.getPersonId()));
Result is: 222 111 555 444 555 111 222 111 333 222 444 111 111 111 444 111 333 111 111 222 555 111 333 333 444 111 111 555
I am expecting the result will to be only the unique keys. Is my understanding wrong on the groupByKey
function in Spark?
Upvotes: 0
Views: 186
Reputation: 330093
groupByKey
, same as other byKey
operations, depends on a meaningful implementation of hashCode
and equals
. Since you don't provide your own implementations, Person
will use the default ones, which are useless in this scenario.
Try for example:
@Override
public int hashCode() {
return this.personId;
}
@Override
public boolean equals(Object o) {
return this.hashCode() == o.hashCode();
}
Upvotes: 1