How to create HBase columns / table for related but separated entities

Question

I saw video tutorial on HBase, where data got stored in a table like this:

EmployeeName - Height - ProjectInfo

------------------------------------

Jdoe - 5'7" - ProjA-TeamLead, ProjB-Contributor

What happens when some Business requirements comes up that name of ProjA has to be changed to ProjX ? Wouldn't there be a separate table where Project information is stored?

Ian Varley · Accepted Answer

In a relational database, yes: you'd have a project table, and the employee table would refer to it via a foreign key and only store the immutable project id (rather than the name). Then when you want to query it (in a relational database), you'd do a JOIN like:

SELECT
  employee.name,
  employee.height,
  project.name,
  employee_project_role.role_name
 FROM
   employee
   INNER JOIN employee_project_role
     ON employee_project_role.employee_id = employee.employee_id
   INNER JOIN project
     ON employee_project_role.project_id = project.project_id

This isn't how things are done in HBase (and other NoSQL databases); the reason is that since these databases are geared towards extremely large data sets, and distributed over many machines, the actual algorithms to transparently execute complex joins like this become a lot harder to pull off in ways that perform well. Thus, HBase doesn't even have built-in joins.

Instead, the general approach with systems like this is that you denormalize your data, and store things in a single table. So in this case, there might be one row per employee, and denormalized into that row is all of the employee's project role info (probably in separate columns -- the contents of a row in HBase is actually a key/value map, so you can represent repeating things like all of their different roles easily).

You're absolutely right, though: if you change the name of the project, that means you'd need to change the data that's stored for every employee. In this respect, the relational model is "cleaner". But if you're dealing with Petabytes of data or trillions of rows, the "clean" abstraction of a relational database becomes a lot messier, because you end up having to shard it all manually. The point of systems like HBase is to pay these costs up front in the design process, and not just assume the relational database will magically solve problems like this for you at scale. (Because it won't).

That said: if you don't expect to have at least Terabtyes of data (that's a million MB, remember), just do it in a relational database. It'll be much easier.

How to create HBase columns / table for related but separated entities

Answers (2)

Related Questions