Reputation: 49
I am working on implementing the incremental process on hive table A; Table A - is already created in hive with partitioned on YearMonth ( YYYYMM column ) with full volume.
On-going basis, we are planning to import the updates/inserts from source and capture in hive Delta Table;
as shown in below picture, Delta table indicates that new updates are pertaining to partitions ( 201804 / 201611 / 201705 ).
For incremental process , I am planning to
INSERT INTO delta2 select YYYYMM from Table where YYYYMM in ( select distinct YYYYMM from Delta );
Merge these 3 partitions from Delta table with corresponding partitions from original table. ( I can follow Horton works 4 step strategy to apply the updates )
Merge Delta2 + Delta : = new 3 partitions.
Drop 3 partitions from original table
Alter Table Drop partitions 201804 / 201611 / 201705
Add newly merged partitions back to Original table ( having new updates )
I need to automate this scripts - Can you please suggest how to put above logic in hive QL or spark - Speacifically Identify partitions and drop them from original table.
Upvotes: 1
Views: 2259
Reputation: 4674
you can build a solution using pyspark. I am explaining this approach with some basic example. you can re-modify it as per your business requirements.
Suppose you have a partitioned table in hive below configuration.
CREATE TABLE IF NOT EXISTS udb.emp_partition_Load_tbl (
emp_id smallint
,emp_name VARCHAR(30)
,emp_city VARCHAR(10)
,emp_dept VARCHAR(30)
,emp_salary BIGINT
)
PARTITIONED BY (Year String, Month String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS ORC;
and you got some csv file with some input records which you want to load into your partitioned table
1|vikrant singh rana|Gurgaon|Information Technology|20000
dataframe = spark.read.format("com.databricks.spark.csv") \
.option("mode", "DROPMALFORMED") \
.option("header", "false") \
.option("inferschema", "true") \
.schema(userschema) \
.option("delimiter", "|").load("file:///filelocation/userinput")
newdf = dataframe.withColumn('year', lit('2018')).withColumn('month',lit('01'))
+------+------------------+--------+----------------------+----------+----+-----+
|emp-id|emp-name |emp-city|emp-department |emp-salary|year|month|
+------+------------------+--------+----------------------+----------+----+-----+
|1 |vikrant singh rana|Gurgaon |Information Technology|20000 |2018|01 |
+------+------------------+--------+----------------------+----------+----+-----+
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
spark.sql("set spark.hadoop.hive.exec.dynamic.partition=true");
spark.sql("set spark.hadoop.hive.exec.dynamic.partition.mode=nonstrict");
newdf.write.format('orc').mode("overwrite").insertInto('udb.emp_partition_Load_tbl')
lets say you got another set of data and want to insert into some other partitions
+------+--------+--------+--------------+----------+----+-----+
|emp-id|emp-name|emp-city|emp-department|emp-salary|year|month|
+------+--------+--------+--------------+----------+----+-----+
| 2| ABC| Gurgaon|HUMAN RESOURCE| 10000|2018| 02|
+------+--------+--------+--------------+----------+----+-----+
newdf.write.format('orc').mode("overwrite").insertInto('udb.emp_partition_Load_tbl')
> show partitions udb.emp_partition_Load_tbl;
+---------------------+--+
| partition |
+---------------------+--+
| year=2018/month=01 |
| year=2018/month=02 |
+---------------------+--+
assuming you have got another set of records pertaining to existing partition.
3|XYZ|Gurgaon|HUMAN RESOURCE|80000
newdf = dataframe.withColumn('year', lit('2018')).withColumn('month',lit('02'))
+------+--------+--------+--------------+----------+----+-----+
|emp-id|emp-name|emp-city|emp-department|emp-salary|year|month|
+------+--------+--------+--------------+----------+----+-----+
| 3| XYZ| Gurgaon|HUMAN RESOURCE| 80000|2018| 02|
+------+--------+--------+--------------+----------+----+-----+
newdf.write.format('orc').mode("overwrite").insertInto('udb.emp_partition_Load_tbl')
select * from udb.emp_partition_Load_tbl where year ='2018' and month ='02';
+---------+-----------+-----------+-----------------+-------------+-------+--------+--+
| emp_id | emp_name | emp_city | emp_dept | emp_salary | year | month |
+---------+-----------+-----------+-----------------+-------------+-------+--------+--+
| 3 | XYZ | Gurgaon | HUMAN RESOURCE | 80000 | 2018 | 02 |
| 2 | ABC | Gurgaon | HUMAN RESOURCE | 10000 | 2018 | 02 |
+---------+-----------+-----------+-----------------+-------------+-------+--------+--+
you can see below that other partiion data was untouched.
> select * from udb.emp_partition_Load_tbl where year ='2018' and month ='01';
+---------+---------------------+-----------+-------------------------+-------------+-------+--------+--+
| emp_id | emp_name | emp_city | emp_dept | emp_salary | year | month |
+---------+---------------------+-----------+-------------------------+-------------+-------+--------+--+
| 1 | vikrant singh rana | Gurgaon | Information Technology | 20000 | 2018 | 01 |
+---------+---------------------+-----------+-------------------------+-------------+-------+--------+--+
Upvotes: 2