V.Edulakanti
V.Edulakanti

Reputation: 21

Spark partition pruning doesn't work on 1.6.0

I created partitioned parquet files on hdfs and created HIVE external table. When I query the table with filter on partitioning column, spark checks all the partition files instead of specific partition. We are on spark 1.6.0.

dataframe:

   df = hivecontext.createDataFrame([
    ("class1", "Economics", "name1", None),
    ("class2","Economics", "name2", 92),
    ("class2","CS", "name2", 92),
    ("class1","CS", "name1", 92)
], ["class","subject", "name", "marks"])

creating parquet partitions:

hivecontext.setConf("spark.sql.parquet.compression.codec", "snappy")
hivecontext.setConf("spark.sql.hive.convertMetastoreParquet", "false")
df1.write.parquet("/transient/testing/students", mode="overwrite", partitionBy='subject')

Query:

df = hivecontext.sql('select * from vatmatching_stage.students where subject = "Economics"')
df.show()

+------+-----+-----+---------+
| class| name|marks|  subject|
+------+-----+-----+---------+
|class1|name1|    0|Economics|
|class2|name2|   92|Economics|
+------+-----+-----+---------+

df.explain(True)

    == Parsed Logical Plan ==
    'Project [unresolvedalias(*)]
    +- 'Filter ('subject = Economics)
       +- 'UnresolvedRelation `vatmatching_stage`.`students`, None

    == Analyzed Logical Plan ==
    class: string, name: string, marks: bigint, subject: string
    Project [class#90,name#91,marks#92L,subject#89]
    +- Filter (subject#89 = Economics)
       +- Subquery students
          +- Relation[class#90,name#91,marks#92L,subject#89] ParquetRelation: vatmatching_stage.students

    == Optimized Logical Plan ==
    Project [class#90,name#91,marks#92L,subject#89]
    +- Filter (subject#89 = Economics)
       +- Relation[class#90,name#91,marks#92L,subject#89] ParquetRelation: vatmatching_stage.students

    == Physical Plan ==
    Scan ParquetRelation: vatmatching_stage.students[class#90,name#91,marks#92L,subject#89] InputPaths: hdfs://dev4/transient/testing/students/subject=Art, hdfs://dev4/transient/testing/students/subject=Civil, hdfs://dev4/transient/testing/students/subject=CS, hdfs://dev4/transient/testing/students/subject=Economics, hdfs://dev4/transient/testing/students/subject=Music

But, if I do the same query on HIVE browser we can see HIVE is doing partition pruning.

44 location hdfs://testing/students/subject=Economics
45 name vatmatching_stage.students
46 numFiles 1
47 numRows -1
48 partition_columns subject
49 partition_columns.types string

Is this limitation in spark 1.6.0 or am I missing something here.

Upvotes: 1

Views: 663

Answers (1)

V.Edulakanti
V.Edulakanti

Reputation: 21

Found the root cause of this issue. HiveContext used for querying the table doesn't have spark.sql.hive.convertMetastoreParquet" set to "false". Its set to "true" - default value.

When I set it to "false", I can see its using partition pruning.

Upvotes: 1

Related Questions