Best way to store relational data in hdfs

Question

I've been reading a lot on hadoop lately and I can say that I understand the general concept of it, but there is still (at least)one piece of the puzzle that I can't get my head around. What is the best way to store relationnal data in hdfs.

First of all, I know that hadoop does not exist to replace my conventional good old sql database that serve my application. The problem that I'm facing here is that I would like to use hadoop to aggregate data from multiple systems into hdfs. I can then cross-reference data from multiple system together and then produce new set of data used by my reporting tools etc.

Alright, so, should I import tables data using one table to one file or should I import the results of queries that join tables.

For example:

SQL tables:

Person: PersonID Name Birthday Sex

Company:

CompanyID
Name
Address

PersonCompany

PersonID
CompanyID

Should I import all 3 table as is or should I import the result of the query returns why person works for which company.

Please share your thought with me!

Gaurav Khare · Accepted Answer

Typically to build a data warehouse in hadoop, you have to ingest all the tables. In your example you need to have all 3 tables in HDFS and then do the ETL/aggregation for example Joiners_weekly can have a etl which have

select * from PersonCompany pc join Person p on pc.personid=p.personid join Company c on pc.companyid=c.companyid.

This can be a report which can be generated from hadoop. Hope this helps.

Best way to store relational data in hdfs

Answers (2)

Related Questions