Reputation: 11
Following previously asked question adding link.
in short: I wrote a file compactor in spark, the way that it works is by reading all files under a directory into a dataframe, performing coalesce over the dataframe (by the number of wanted files), writing them back into their directory and then compressing them into snappy.
The problem I have: The directories I'm compacting are actually partitions under a table in Apache hive, after rewriting back the files into their directory and performing a basic select query over the partition in hive it seems that the data is being altered, for example:
This table:
Column A | Column B |
---|---|
1 | null |
null | 1 |
Turns into:
Column A | Column B |
---|---|
1 | null |
1 | null |
can someone please help me understand why does the data is being altered and how can i fix it?
Upvotes: 0
Views: 292
Reputation: 1483
Pyspark with hive portioned table compaction can be performed using the below code .
Make sure your mentioning the partition columns
table_name = "your_table_name"
partition_columns = ["partition_col1", "partition_col2"]
spark.sql(f"MSCK REPAIR TABLE {DB_name}.{table_name}")
spark.sql(f"ANALYZE TABLE {table_name} COMPUTE STATISTICS")
Check the statistics of query is completely optional .
Upvotes: 0
Reputation: 1
sounds like your problem is in the coalesce
directive, when it unifies the data it replaces it and can lead to inconsistencies
Upvotes: 0