Reputation: 842
I have hive table which has data and its partitioned on a partition column which is based on Year.Now data is getting loaded daily into this hive table. I dont have an option to do daily msck repair. My partition is based on year. So do i need to msck repair after daily load if new partition is not added. I have tried below
val data = Seq(Row("1","2020-05-11 15:17:57.188","2020"))
val schemaOrig = List( StructField("key",StringType,true)
,StructField("txn_ts",StringType,true)
,StructField("txn_dt",StringType,true))
val sourceDf = spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schemaOrig))
sourceDf.write.mode("overwrite").partitionBy("txn_dt").avro("/test_a")
HIVE EXTERNAL TABLE
create external table test_a(
key string,
txn_ts string
)
partitioned by (txn_dt string)
stored as avro
location '/test_a';
msck repair table test_a;
select * from test_a;
Upvotes: 1
Views: 519
Reputation: 842
Noticed if new partition not added msck repair is not needed
msck repair table test_a;
select * from test_a;
+----------------+--------------------------+------------------------+--+
| test_a.rowkey | test_a.txn_ts | test_a.order_entry_dt |
+----------------+--------------------------+------------------------+--+
| 1 | 2020-05-11 15:17:57.188 | 2020 |
+----------------+--------------------------+------------------------+--+
Now added 1 more row with the same partition value (2020)
val data = Seq(Row("2","2021-05-11 15:17:57.188","2020"))
val schemaOrig = List( StructField("rowkey",StringType,true)
,StructField("txn_ts",StringType,true)
,StructField("order_entry_dt",StringType,true))
val sourceDf = spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schemaOrig))
sourceDf.write.mode("append").partitionBy("order_entry_dt").avro("/test_a")
**HIVE QUERY rETURNED 2 ROWS**
select * from test_a;
+----------------+--------------------------+------------------------+--+
| test_a.rowkey | test_a.txn_ts | test_a.order_entry_dt |
+----------------+--------------------------+------------------------+--+
| 1 | 2021-05-11 15:17:57.188 | 2020 |
| 2 | 2020-05-11 15:17:57.188 | 2020 |
+----------------+--------------------------+------------------------+--+
--Now tried adding NEW PARTITION (2021) to see if select query will
return it with out msck
val data = Seq(Row("3","2021-05-11 15:17:57.188","2021"))
val schemaOrig = List( StructField("rowkey",StringType,true)
,StructField("txn_ts",StringType,true)
,StructField("order_entry_dt",StringType,true))
val sourceDf = spark.createDataFrame(spark.sparkContext.parallelize(data),StructType(schemaOrig))
sourceDf.write.mode("append").partitionBy("order_entry_dt").avro("/test_a")
QUERY AGAIN RETURNED 2 ROWS ONLY INSTEAD OF 3 ROWS with out msck repair
Upvotes: 1