hive daily msck repair needed if new partition not added

Question

I have hive table which has data and its partitioned on a partition column which is based on Year.Now data is getting loaded daily into this hive table. I dont have an option to do daily msck repair. My partition is based on year. So do i need to msck repair after daily load if new partition is not added. I have tried below

val data = Seq(Row("1","2020-05-11 15:17:57.188","2020"))
val schemaOrig = List( StructField("key",StringType,true)
                      ,StructField("txn_ts",StringType,true)
                      ,StructField("txn_dt",StringType,true))

val sourceDf =  spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schemaOrig))
sourceDf.write.mode("overwrite").partitionBy("txn_dt").avro("/test_a")

HIVE EXTERNAL TABLE

create external table test_a(
key    string,
txn_ts string
)
partitioned by (txn_dt string)
stored as avro
location '/test_a';

msck repair table test_a;
select * from test_a;

Ajith Kannan · Accepted Answer

    Noticed if new partition not added msck repair is not needed
    
    msck repair table test_a;
    select * from test_a;
    
        +----------------+--------------------------+------------------------+--+
        | test_a.rowkey  |      test_a.txn_ts       | test_a.order_entry_dt  |
        +----------------+--------------------------+------------------------+--+
        | 1              | 2020-05-11 15:17:57.188  | 2020                   |
        +----------------+--------------------------+------------------------+--+
    
    Now added 1 more row with the same partition value (2020) 
    
        val data = Seq(Row("2","2021-05-11 15:17:57.188","2020"))
        val schemaOrig = List( StructField("rowkey",StringType,true)
        ,StructField("txn_ts",StringType,true)
        ,StructField("order_entry_dt",StringType,true))
        val sourceDf =  spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schemaOrig))
        sourceDf.write.mode("append").partitionBy("order_entry_dt").avro("/test_a")
    
    **HIVE QUERY rETURNED 2 ROWS**
        select * from test_a;

    +----------------+--------------------------+------------------------+--+
    | test_a.rowkey  |      test_a.txn_ts       | test_a.order_entry_dt  |
    +----------------+--------------------------+------------------------+--+
    | 1              | 2021-05-11 15:17:57.188  | 2020                   |
    | 2              | 2020-05-11 15:17:57.188  | 2020                   |
    +----------------+--------------------------+------------------------+--+
    
        --Now tried adding NEW PARTITION (2021) to see if select query will 
 return it with out msck
        val data = Seq(Row("3","2021-05-11 15:17:57.188","2021"))
        val schemaOrig = List( StructField("rowkey",StringType,true)
        ,StructField("txn_ts",StringType,true)
        ,StructField("order_entry_dt",StringType,true))
        val sourceDf =  spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schemaOrig))
        sourceDf.write.mode("append").partitionBy("order_entry_dt").avro("/test_a")
    
    QUERY AGAIN RETURNED 2 ROWS ONLY INSTEAD OF 3 ROWS with out msck repair

hive daily msck repair needed if new partition not added

Answers (1)

Related Questions