Reputation: 13
I noticed a huge difference in performance when I run HiveContext versus SQLContext within PySpark for comparable queries
from pyspark.sql import SQLContext
sqlContext = HiveContext(sc)
qry = "select count(*) from database.table a where a.field1 = 'ABCD'"
results = sqlContext.sql(qry).collect()
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.parquetFile('hdfs_path_to_hive_table/field1=ABCD/')
df.select("field2").show()
Any help would be greatly appreciated!
I have tried:
SET spark.sql.hive.metastorePartitionPruning=true
and I was still running into the same issue. I let the process run for a little while longer to test how high the CPU usage would escalate and it reached upwards of 2000%!
I heard that parquet formatted files were potentially an issue with spark prior to version 1.5 so all of my testing with these additional settings in spark 1.5.1:
parquet.task.side.metadata=false
SET spark.sql.parquet.filterPushdown=true
SET spark.sql.parquet.cacheMetadata=false
but none of them seemed to help either.
In my quest for an answer I came across these various links which led me to try the above configurations:
Upvotes: 1
Views: 7139
Reputation: 2611
collect() ---> Gets all the data to edge node
show() ---> Will show few sample points, first 20 data points
When the data is huge, obviously you will see pretty huge difference in time and memory.
Upvotes: 0
Reputation: 11
.collect() and .show() are very different
Perhaps the performance difference you see is due to the difference between collect (which pulls the entire resulting dataframe into the driver) and show (which by default only shows the first 20 rows of the resulting dataframe).
It seems that you're not doing any shuffling operations in the lineage, so it could be that the show is just pulling in only 20 rows (instead of the whole dataset, as in the .collect() case)
Upvotes: 1
Reputation: 1565
This is likely not a difference between HiveContext/SQLContext, but instead a difference between a table where the metadata is coming from the HiveMetastore vs the SparkSQL Data Source API. I would guess that if you create the table the same way, the performance would be similar.
In the data source API we have spent a fair amount of time optimizing the discovery and handling of many partitions, and in general I would say this path is easier to use / faster.
Likely the problem with the hive table is downloading all of the partition metadata from the metastore and converting it to our internal format. We do this for all partitions, even though in this case you only want the first ~20 rows.
To improve performance in this case I'd try running:
SET spark.sql.hive.metastorePartitionPruning=true
Upvotes: 1