Jessica Smith
Jessica Smith

Reputation: 151

Apache Spark JDBCRDD uses HDFS ?

Does Apache Spark JDBCRDD use HDFS to store and distribute the database records to worker nodes? We are using JdbcRDD to interact with a database on apache spark. We are wondering as to whether Apache Spark uses HDFS to distribute and store the database table records or does the worker nodes directly interact with the db.

Upvotes: 3

Views: 193

Answers (1)

mattinbits
mattinbits

Reputation: 10428

JdbcRDD does not use HDFS, reads the data from the JDBC connection directly to the RDD in the worker's memory. If you wanted the results on HDFS, you'd have to explicitly persist the RDD to HDFS.

You can see how JdbcRDD operates here https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala

RDDs must implement a compute method which returns an iterator for the values of each partition in the RDD. The JdbcRDD implementation just wraps a JDBC result set iterator:

override def getNext(): T = {
      if (rs.next()) {
        mapRow(rs)
      } else {
        finished = true
        null.asInstanceOf[T]
      }
}

Upvotes: 2

Related Questions