Reputation: 23
I tried using the Apache Spark SQL dataframe's aggregate functions "first" and "last" on a large file with a spark master and 2 workers. When I do the "first" and "last" operations I am expecting back the last column from the file; but it looks like Spark is returning the "first" or "last" from the worker partitions.
Is there any way to get the "real" first and last values in aggregate functions?
Thanks,
Upvotes: 1
Views: 1308
Reputation: 3727
Yes. It is possible depending on what you mean first "real" first and last values. For example, if you are dealing with timestamped data and "real" first value refers to the oldest record, just orderBy
the data according to time and get the first value.
When you say When I do the "first" and "last" operations I am expecting back the last column from the file
, I understand that you are in fact referring to the first/last row of data from the file. Please correct me if I mistook this.
Thanks.
Edit :
You can read the file in a single partition (by setting numPartitions
= 1) and then zipWithIndex
and finally parallize
the resulting collection. This way you get a column to order on and you don't change the source file as well.
Upvotes: 1