charles.drotar
charles.drotar

Reputation: 13

PySpark - Hive Context Does Not Return Results but SQL Context Does for Similar Query

I noticed a huge difference in performance when I run HiveContext versus SQLContext within PySpark for comparable queries

VERSIONS/CONFIGURATION

TABLE INFORMATION

HIVECONTEXT IMPLEMENTATION

from pyspark.sql import SQLContext
sqlContext = HiveContext(sc) 
qry = "select count(*) from database.table a where a.field1 = 'ABCD'"
results = sqlContext.sql(qry).collect()

SQLCONTEXT IMPLEMENTATION

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.parquetFile('hdfs_path_to_hive_table/field1=ABCD/')
df.select("field2").show()

QUESTIONS

Any help would be greatly appreciated!

UPDATE 10/16/2015

I have tried:

SET spark.sql.hive.metastorePartitionPruning=true

and I was still running into the same issue. I let the process run for a little while longer to test how high the CPU usage would escalate and it reached upwards of 2000%!

I heard that parquet formatted files were potentially an issue with spark prior to version 1.5 so all of my testing with these additional settings in spark 1.5.1:

parquet.task.side.metadata=false
SET spark.sql.parquet.filterPushdown=true
SET spark.sql.parquet.cacheMetadata=false

but none of them seemed to help either.

In my quest for an answer I came across these various links which led me to try the above configurations:

Upvotes: 1

Views: 7139

Answers (3)

Jack Daniel
Jack Daniel

Reputation: 2611

collect() ---> Gets all the data to edge node
show() ---> Will show few sample points, first 20 data points

When the data is huge, obviously you will see pretty huge difference in time and memory.

Upvotes: 0

gy8
gy8

Reputation: 11

.collect() and .show() are very different

Perhaps the performance difference you see is due to the difference between collect (which pulls the entire resulting dataframe into the driver) and show (which by default only shows the first 20 rows of the resulting dataframe).

It seems that you're not doing any shuffling operations in the lineage, so it could be that the show is just pulling in only 20 rows (instead of the whole dataset, as in the .collect() case)

Upvotes: 1

Michael Armbrust
Michael Armbrust

Reputation: 1565

This is likely not a difference between HiveContext/SQLContext, but instead a difference between a table where the metadata is coming from the HiveMetastore vs the SparkSQL Data Source API. I would guess that if you create the table the same way, the performance would be similar.

In the data source API we have spent a fair amount of time optimizing the discovery and handling of many partitions, and in general I would say this path is easier to use / faster.

Likely the problem with the hive table is downloading all of the partition metadata from the metastore and converting it to our internal format. We do this for all partitions, even though in this case you only want the first ~20 rows.

To improve performance in this case I'd try running:

SET spark.sql.hive.metastorePartitionPruning=true

Upvotes: 1

Related Questions