Reputation: 1328
I've been reading a lot on hadoop lately and I can say that I understand the general concept of it, but there is still (at least)one piece of the puzzle that I can't get my head around. What is the best way to store relationnal data in hdfs.
First of all, I know that hadoop does not exist to replace my conventional good old sql database that serve my application. The problem that I'm facing here is that I would like to use hadoop to aggregate data from multiple systems into hdfs. I can then cross-reference data from multiple system together and then produce new set of data used by my reporting tools etc.
Alright, so, should I import tables data using one table to one file or should I import the results of queries that join tables.
For example:
SQL tables:
Person: PersonID Name Birthday Sex
Company:
CompanyID
Name
Address
PersonCompany
PersonID
CompanyID
Should I import all 3 table as is or should I import the result of the query returns why person works for which company.
Please share your thought with me!
Upvotes: 2
Views: 1746
Reputation: 2327
Typically to build a data warehouse in hadoop, you have to ingest all the tables. In your example you need to have all 3 tables in HDFS and then do the ETL/aggregation for example Joiners_weekly can have a etl which have
select * from PersonCompany pc join Person p on pc.personid=p.personid join Company c on pc.companyid=c.companyid.
This can be a report which can be generated from hadoop. Hope this helps.
Upvotes: 1
Reputation: 284
I think you should consider using HBase on top of HDFS to get some Database functionality:
Upvotes: 1