SQLContext vs DataFrameLoader

Question

I'm working from an exercise in a book, but it's not liking the way that SQLContext.load is being used. The first step launches pyspark with specific parameters:

pyspark --driver-class-path /usr/share/java/mysql-connector-java-5.1.39-bin.jar --master local

And this goes fine. Next, an import:

from pyspark.sql import SQLContext
sqlctx = SQLContext(sc)

Then comes the contentious part:

>>> employeesdf = sqlctx.load(source="jdbc",
... url="jdbc:mysql://localhost:3306/employees?user=&password=",
... dbtable="employees",
... partitionColumn="emp_no",
... numPartitions="2",
... lowerBound="10001",
... upperBound="499999"
... )

Now, I'm supposed to follow this up with employees.rdd.getNumPartitions() but before at the end of the previous string, I get the error "AttributeError: 'SQLContext' object has no attribute 'load'"

The book seems to have anticipated this, because it says, "Check the API documentation for the version of Spark you are using, in more recent releases you are encouraged to use the load method from the DataFrameReader object instead of the SQLContext."

So I tried the same example, except substituting "sqlctx" with "DataFrameReader":

>>> employeesdf = DataFrameReader.load(source="jdbc",
... url="jdbc:mysql://localhost:3306/employees?user=password=",
... dbtable="employees",
... partitionColumn="emp_no",
... numPartitions="2",
... lowerBound="10001",
... upperBound="499999"
... )

I then get the error: "TypeError: unbound method load() must be called with DataFrameReader instance as first argument (got nothing instead)" So I suspect that I'm using DataFrameReader incorrectly, but despite looking through the documentation I can't tell what the proper use is. Can anyone tell me what I'm doing wrong? Thanks in advance for any assistance.

(Spark version is 2.1.1)

Garren S · Accepted Answer

SQLContext is not the preferred way to load data for spark 2.x; it's present for backwards compatibility. Use spark.read.jdbc where spark is a SparkSession object. SparkSession is the newest and modern way to access just about everything that was formerly encapsulated in SparkContext and SQLContext. I recommend Jacek's git book on mastering spark for a phenomenal guide to current Spark APIs (2.x) and really all about Spark in general.

SQLContext vs DataFrameLoader

Answers (1)

Related Questions