Reputation: 2571
i have faced a question in an interview to design a schema. But i have doubt like while designing the schema in hive, do we follow normalization process or is not required?? Please help me to understand
Upvotes: 0
Views: 2638
Reputation: 371
With Hive, a good practice is to denormalize your data.
A join require a reducer, implying that the data has been sorted on shuffle stage. If you need to filter on a field from the joined table, you need one map, one shuffle, one reduce (for the join), and another map for the filter. (Unless one one your tables is small enough to fit in the mapper memory - 10Mb by default)
Map (read) -> Shuffle (sort) -> reduce (join) -> Map (filter) -> Reduce (write)
A filtering on a field present in the denormalized table is done on map stage.
Map (read + filter) -> Shuffle (sort) -> Reduce (write)
You can get even better performance using advanced storage formats (ORCfile, Parquet, ...) which sotre some metadata with the data, allowing to skip blocks during filter phase (predicate push down on storage layer)
That answer is valid only if you use the map-reduce hive engine. Using another engin with Hive (Tez, Spark, ...) or another SQL-in-Hadoop software (Impal, Drill, ...), you may observe different behavior.
You can find more information on joins in Hive manual page
Upvotes: 2