What is the fastest way to read database using PySpark?

Question

I am trying to read a table of a database using PySpark and SQLAlchamy as follows:

SUBMIT_ARGS = "--jars mysql-connector-java-5.1.45-bin.jar pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
sc = SparkContext('local[*]', 'testSparkContext')
sqlContext = SQLContext(sc)

t0 = time.time()
database_uri =  'jdbc:mysql://{}:3306/{}'.format("127.0.0.1",)
dataframe_mysql = sqlContext.read.format("jdbc").options(url=database_uri, driver = "com.mysql.jdbc.Driver", dbtable = , user= , password=).load()
print(dataframe_mysql.rdd.map(lambda row :list(row)).collect())

t1 = time.time()
database_uri2 =  'mysql://{}:{}@{}/{}'.format(,,"127.0.0.1",)
engine = create_engine(database_uri2)
connection = engine.connect()
s = text("select * from {}.{}".format(,))
result = connection.execute(s)
for each in result:
     print(each)
t2= time.time()

print("Time taken by PySpark:", (t1-t0))
print("Time taken by SQLAlchamy", (t2-t1))

This is the time taken to fetch some 3100 rows:

Time taken by PySpark: 12.326745986938477
Time taken by SQLAlchamy: 0.21664714813232422

Why SQLAlchamy is outperforming PySpark? Is there any way to make this faster? Is there any error in my approach?

What is the fastest way to read database using PySpark?

Answers (1)

Related Questions