gourxb
gourxb

Reputation: 325

What is the fastest way to read database using PySpark?

I am trying to read a table of a database using PySpark and SQLAlchamy as follows:

SUBMIT_ARGS = "--jars mysql-connector-java-5.1.45-bin.jar pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
sc = SparkContext('local[*]', 'testSparkContext')
sqlContext = SQLContext(sc)

t0 = time.time()
database_uri =  'jdbc:mysql://{}:3306/{}'.format("127.0.0.1",<db_name>)
dataframe_mysql = sqlContext.read.format("jdbc").options(url=database_uri, driver = "com.mysql.jdbc.Driver", dbtable = <tablename>, user= <user>, password=<password>).load()
print(dataframe_mysql.rdd.map(lambda row :list(row)).collect())

t1 = time.time()
database_uri2 =  'mysql://{}:{}@{}/{}'.format(<user>,<password>,"127.0.0.1",<db_name>)
engine = create_engine(database_uri2)
connection = engine.connect()
s = text("select * from {}.{}".format(<db_name>,<table_name>))
result = connection.execute(s)
for each in result:
     print(each)
t2= time.time()

print("Time taken by PySpark:", (t1-t0))
print("Time taken by SQLAlchamy", (t2-t1))

This is the time taken to fetch some 3100 rows:

Time taken by PySpark: 12.326745986938477
Time taken by SQLAlchamy: 0.21664714813232422

Why SQLAlchamy is outperforming PySpark? Is there any way to make this faster? Is there any error in my approach?

Upvotes: 3

Views: 3771

Answers (1)

user9579544
user9579544

Reputation: 61

Why SQLAlchamy is outperforming PySpark? Is there any way to make this faster? Is there any error in my approach?

More than one. Ultimately you try use Spark in a way it is not intended to be used, measure incorrect thing and introduce incredible amount of indirection. Overall:

  • JDBC DataSource is inefficient, and as you use it is completely sequential. Check parallellizing reads in Spark Gotchas.
  • Collecting data is not intended for production use in practice.
  • You introduce a lot of indirection, by converting data to RDD and serializing, fetching to driver and deserializing.
  • Your code measures not only data processing time, but also cluster / contexts initialization time.
  • local mode (designed for prototyping and unit testing) is just a cherry on the top.
  • And so on...

So at the end of the day your code is slow but it is not something you'd use in production application. SQLAlchemy and Spark are designed for complete different purposes - if you're looking for low latency database access layer Spark is not the right choice.

Upvotes: 6

Related Questions