Breach
Breach

Reputation: 1328

Best way to store relational data in hdfs

I've been reading a lot on hadoop lately and I can say that I understand the general concept of it, but there is still (at least)one piece of the puzzle that I can't get my head around. What is the best way to store relationnal data in hdfs.

First of all, I know that hadoop does not exist to replace my conventional good old sql database that serve my application. The problem that I'm facing here is that I would like to use hadoop to aggregate data from multiple systems into hdfs. I can then cross-reference data from multiple system together and then produce new set of data used by my reporting tools etc.

Alright, so, should I import tables data using one table to one file or should I import the results of queries that join tables.

For example:

SQL tables:

Person: PersonID Name Birthday Sex

Company:

CompanyID
Name
Address

PersonCompany

PersonID
CompanyID

Should I import all 3 table as is or should I import the result of the query returns why person works for which company.

Please share your thought with me!

Upvotes: 2

Views: 1746

Answers (2)

Gaurav Khare
Gaurav Khare

Reputation: 2327

Typically to build a data warehouse in hadoop, you have to ingest all the tables. In your example you need to have all 3 tables in HDFS and then do the ETL/aggregation for example Joiners_weekly can have a etl which have

select * from PersonCompany pc join Person p on pc.personid=p.personid join Company c on pc.companyid=c.companyid.

This can be a report which can be generated from hadoop. Hope this helps.

Upvotes: 1

Nico Weisenauer
Nico Weisenauer

Reputation: 284

I think you should consider using HBase on top of HDFS to get some Database functionality:

http://hbase.apache.org/

Upvotes: 1

Related Questions