goutham
goutham

Reputation: 31

Why does groupByKey not give correct groups when used with a custom class?

I defined a custom class Person for my data and used groupByKey operation as follows:

public class Person implements Serializable {
    private static final long serialVersionUID = 1L;
    private int personId;
    private String name;
    private String address;
    public Person(int personId, String name, String address) {
        this.personId = personId;
        this.name = name;
        this.address = address;
    }
    public int getPersonId() {  return personId;}
    public void setPersonId(int personId) { this.personId = personId;}
    public String getName() {   return name;}
    public void setName(String name) {  this.name = name;}
    public String getAddress() {    return address;}
    public void setAddress(String address) {    this.address = address;}
}
List<Person> personList = new ArrayList<Person>();
personList.add(new Person(111, "abc", "test1"));
personList.add(new Person(222, "def", "test2"));
personList.add(new Person(333, "fhg", "test3"));
personList.add(new Person(111, "jkl", "test4"));
personList.add(new Person(555, "mno", "test5"));
personList.add(new Person(444, "pqr", "test6"));
personList.add(new Person(111, "xyz", "test7"));

JavaRDD<Person> initialRDD = sc.parallelize(personList, 4);

JavaPairRDD<Person, Iterable<Person>> groupedBy = 
    initialRDD.cartesian(initialRDD).groupByKey();

But the result for this using the following does not do any grouping based on the keys.

groupedBy.foreach(x -> System.out.println(x._1.getPersonId()));

Result is: 222 111 555 444 555 111 222 111 333 222 444 111 111 111 444 111 333 111 111 222 555 111 333 333 444 111 111 555

I am expecting the result will to be only the unique keys. Is my understanding wrong on the groupByKey function in Spark?

Upvotes: 0

Views: 186

Answers (1)

zero323
zero323

Reputation: 330093

groupByKey, same as other byKey operations, depends on a meaningful implementation of hashCode and equals. Since you don't provide your own implementations, Person will use the default ones, which are useless in this scenario.

Try for example:

@Override
public int hashCode() {
    return this.personId;
}

@Override
public boolean equals(Object o) {
    return this.hashCode() == o.hashCode();
}

Upvotes: 1

Related Questions