Ida
Ida

Reputation: 2999

Difference between Spark RDD's take(1) and first()

I used to think that rdd.take(1) and rdd.first() are exactly the same. However I began to wonder if this is really true after my colleague pointed me to Spark's officiation documentation on RDD:

first(): Return the first element in this RDD.

take(num): Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.

My questions are:

  1. Is the underlying implementation of first() the same as take(1)?
  2. Suppose rdd1 and rdd2 are constructed from the same csv, can I safely assume that rdd1.take(1) and rdd2.first() will always return the same result, i.e., the first row of the csv? What if rdd1 and rdd2 are partitioned differently?

Upvotes: 15

Views: 44747

Answers (3)

Amit Kumar
Amit Kumar

Reputation: 2745

No both are not same.

rdd.first() will Return the first element in this RDD while rdd.take(1) will return an array that will have first element only.

  1. Is the underlying implementation of first() the same as take(1)?

Ans : In terms of implementation first() calls take(1) internally and returns first and only element of the array returned by take(1). Taken from org.apache.spark.rdd.RDD class

  /**
   * Return the first element in this RDD.
   */
  def first(): T = withScope {
    take(1) match {
      case Array(t) => t
      case _ => throw new UnsupportedOperationException("empty collection")
    }
  }
  1. Suppose rdd1 and rdd2 are constructed from the same csv, can I safely assume that rdd1.take(1) and rdd2.first() will always return the same result, i.e., the first row of the csv? What if rdd1 and rdd2 are partitioned differently?

Ans : Yes you can assume, partitioning do not change the order in which input was read.

Upvotes: 6

Gaurav Sehrawat
Gaurav Sehrawat

Reputation: 1

So, its seems that both are same, but we do have differences.

1.When we read data from the file, it is by default a RDD, and a RDD have both first() and take() attributes.
2.The first() attribute returns a row type object while take() returns a list type.

But As soon as we converts our RDD to DataFrame using .toDF(), we don't have the first() attribute on that DF.

Hope it may clear the concepts further.

See the image for more clearity

Upvotes: 0

Pranav Shukla
Pranav Shukla

Reputation: 2226

Infact first is implemented in terms of take.

Following is taken from spark's source of RDD.scala. first calls take(1) and returns the first element if found.

  def first(): T = withScope {
    take(1) match {
      case Array(t) => t
      case _ => throw new UnsupportedOperationException("empty collection")
    }
  }

take(num) tries to take num elements from starting from RDD's 0th partition (if you consider 0 based indexes). So the behavior of take(1) and first will be identical.

Even the spark programming guide confirms this.

About your second question: it depends what you mean when you say partitioned differently. If you are calling sc.textFile("/path/to/file") with or without numPartitions, it wouldn't matter because 0th partition will always be 0th partition. So Yes, you can assume that they will have the same first element.

EDIT: Partitions in RDD are ordered, the physical first line in your CSV will end up in the 0th partition on RDD. And take(1) and first both will return that first row of 0th partition.

Upvotes: 21

Related Questions