Spark groupBy agg not working as expected

Question

I am getting similar issue:

(df
    .groupBy("email")
    .agg(last("user_id") as "user_id")
    .select("user_id").count,
df
    .groupBy("email")
    .agg(last("user_id") as "user_id")
    .select("user_id")
    .distinct
    .count)

When run on one computer it gives: (15123144,15123144)

When run on cluster it gives: (15123144,24)

The first one is expected and looks correct but second one is horribly wrong. One more observation - even if I change data where total count is more/less than 15123144 I get distinct = 24 on cluster. Even if I interchange user_id and email, it gives same distinct count.

I am more confused by seeing: https://spark.apache.org/docs/1.5.2/api/scala/index.html#org.apache.spark.sql.DataFrame

Agg doc says: Aggregates on the entire DataFrame without groups. "Without group"? what does that mean?

Any clue? or Jira ticket? or what can be fix for now?

Spark groupBy agg not working as expected

Answers (1)

Related Questions