sfbay
sfbay

Reputation: 23

SQL dataframe first and last not returning "real" first and last values

I tried using the Apache Spark SQL dataframe's aggregate functions "first" and "last" on a large file with a spark master and 2 workers. When I do the "first" and "last" operations I am expecting back the last column from the file; but it looks like Spark is returning the "first" or "last" from the worker partitions.

Is there any way to get the "real" first and last values in aggregate functions?

Thanks,

Upvotes: 1

Views: 1308

Answers (1)

Shagun Sodhani
Shagun Sodhani

Reputation: 3727

Yes. It is possible depending on what you mean first "real" first and last values. For example, if you are dealing with timestamped data and "real" first value refers to the oldest record, just orderBy the data according to time and get the first value.

When you say When I do the "first" and "last" operations I am expecting back the last column from the file, I understand that you are in fact referring to the first/last row of data from the file. Please correct me if I mistook this.

Thanks.

Edit :

You can read the file in a single partition (by setting numPartitions = 1) and then zipWithIndex and finally parallize the resulting collection. This way you get a column to order on and you don't change the source file as well.

Upvotes: 1

Related Questions