Reputation: 151
Does Apache Spark JDBCRDD use HDFS to store and distribute the database records to worker nodes? We are using JdbcRDD to interact with a database on apache spark. We are wondering as to whether Apache Spark uses HDFS to distribute and store the database table records or does the worker nodes directly interact with the db.
Upvotes: 3
Views: 193
Reputation: 10428
JdbcRDD does not use HDFS, reads the data from the JDBC connection directly to the RDD in the worker's memory. If you wanted the results on HDFS, you'd have to explicitly persist the RDD to HDFS.
You can see how JdbcRDD operates here https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala
RDDs must implement a compute
method which returns an iterator for the values of each partition in the RDD. The JdbcRDD implementation just wraps a JDBC result set iterator:
override def getNext(): T = {
if (rs.next()) {
mapRow(rs)
} else {
finished = true
null.asInstanceOf[T]
}
}
Upvotes: 2