Liran Eliyahu
Liran Eliyahu

Reputation: 11

Data in hive table is changed after running a compaction in pyspark

Following previously asked question adding link.

in short: I wrote a file compactor in spark, the way that it works is by reading all files under a directory into a dataframe, performing coalesce over the dataframe (by the number of wanted files), writing them back into their directory and then compressing them into snappy.

The problem I have: The directories I'm compacting are actually partitions under a table in Apache hive, after rewriting back the files into their directory and performing a basic select query over the partition in hive it seems that the data is being altered, for example:

This table:

Column A Column B
1 null
null 1

Turns into:

Column A Column B
1 null
1 null

can someone please help me understand why does the data is being altered and how can i fix it?

Upvotes: 0

Views: 292

Answers (2)

Indrajit Swain
Indrajit Swain

Reputation: 1483

  1. Pyspark with hive portioned table compaction can be performed using the below code .

  2. Make sure your mentioning the partition columns

    table_name = "your_table_name"

    partition_columns = ["partition_col1", "partition_col2"]

    spark.sql(f"MSCK REPAIR TABLE {DB_name}.{table_name}")

    spark.sql(f"ANALYZE TABLE {table_name} COMPUTE STATISTICS")

Check the statistics of query is completely optional .

Upvotes: 0

Ariel Grosh
Ariel Grosh

Reputation: 1

sounds like your problem is in the coalesce directive, when it unifies the data it replaces it and can lead to inconsistencies

Upvotes: 0

Related Questions