Data in hive table is changed after running a compaction in pyspark

Following previously asked question adding link.

in short: I wrote a file compactor in spark, the way that it works is by reading all files under a directory into a dataframe, performing coalesce over the dataframe (by the number of wanted files), writing them back into their directory and then compressing them into snappy.

The problem I have: The directories I'm compacting are actually partitions under a table in Apache hive, after rewriting back the files into their directory and performing a basic select query over the partition in hive it seems that the data is being altered, for example:

This table:

Column A	Column B
1	null
null	1

Turns into:

Column A	Column B
1	null
1	null

can someone please help me understand why does the data is being altered and how can i fix it?

Upvotes: 0

Answers (2)

Indrajit Swain

Reputation: 1483

Pyspark with hive portioned table compaction can be performed using the below code .
Make sure your mentioning the partition columns

table_name = "your_table_name"

partition_columns = ["partition_col1", "partition_col2"]

spark.sql(f"MSCK REPAIR TABLE {DB_name}.{table_name}")

spark.sql(f"ANALYZE TABLE {table_name} COMPUTE STATISTICS")

Check the statistics of query is completely optional .

Upvotes: 0

Ariel Grosh

Reputation: 1

sounds like your problem is in the coalesce directive, when it unifies the data it replaces it and can lead to inconsistencies

Upvotes: 0

Data in hive table is changed after running a compaction in pyspark

Answers (2)

Related Questions