Reputation: 147
I am trying to run tpc-ds benchmark with sparksql.
In the document they talk about having star schema and number of tables.
From what my understanding of hadoop is , its better to have denormalized data, and then you can you format like paraquet which are good in compression. (use partitions for parallelism)
I also found this document from SAS -> https://support.sas.com/resources/papers/data-modeling-hadoop.pdf
which also talks in the same term. I am no dataware house expert, so I will request , to help me understand how to model data for dataware house in hadoop
Upvotes: 1
Views: 265