Reputation: 284
Requirement: Need to fetch 1000s of tables from RDBMS (SQL Server) and push it in Hive for running analytics and generating reports in near real time. I am using Nifi to read data from RDBMS tables and pushing to hive tables as external tables.
Issue: Let's say I have a table = Employee in RDBMS with 10 records. I run a processor in Nifi which constantly reads my Employee table to fetch newly updated records based on "lastModifiedTimeStamp" column. Now lets say both, Hive and Rdbms tables are in sync, both are having 10 records and now I update 1 row in RDBMS and now Nifi will read this newly updated record and "PutORC"/"PutHDFS" processor places the new record in new external file on specified Hdfs directory.
Now when I run "select * from employee" in Hive, it returns 11 rows instead of 10. I want it to update the existing record if present.
Question: For every update in the RDBMS, my hive table will grow exponentially with duplicate records. How do I keep all updated records in sync with Hive in almost near real time? How do people working with Hive/hadoop manage this duplicate data while querying? Is it a correct approach?
I know there is MERGE command which can merge 2 tables to keep data in sync but that would be very heavy operation as we would need to fetch whole table's data into staging table every 5-10 mins.
May be PutHiveQL can help here updating the records in Hive but I am now sure, how? Do I need to fetch from CDC SQL Server table? Any approach?
Note: I know we should not use hive as RDBMS but since Hive now supports Update/delete operations, wanted to know what is the correct approach and how people use this and manage duplicate data which is getting created when you pull it each time from RDBMS.
TIA
Upvotes: 0
Views: 328
Reputation: 5125
You want to actually create two tables. And stick a View on top of both.
You use the view to query data. You have a base table, with data from the past, and You have a delta table that has new data that hasn't been merged into the base table.
Then you can write a view query that helps keep the only shows 10 records of the most up to date data.
Periodically you update the base table with delta records. so that delta doesn't grow large, can fit in memory and can be joined to base table with a map side join.
This helps to break the problem down and I think will work for you.
Upvotes: 1