Tools implementing management and usage of indexes on WORM data storage like Apache Parquet files

Question

Indexed columns allow quick access to data in database, based on matching/filtering criteria. This is a common feature in relational databases, but is harder to find in Big Data tooling. My question is about the technologies that currently allow indexed access to WORM files data.

Apache Parquet is a size-efficient compressed Big Data WORM columnar self-descripting file storage open-source format. It includes some metadata statistics, like min/max or dictionary entries values per row group (10000 rows wide). The metadata is used by query engines to prune irrelevant row groups from being read, but is effective only on numerical sorted or dictionary encoded values. Bloom filtering also helps query planning optimization to select relevant row groups for equality matching criteria. However, loading a row group is not that efficient in the case joining fact and dimension tables. Worse, random data like UUID cannot benefit metadata planning optimization. Beginning with version 2.5, Apache Parquet has the ability to include index columns. However, I have not found information about tools implementations that allow to manage and use these index columns in query planning optimizations. The situation is similar with Apache ORC, another WORM file format: I have not found clear technical information about its ability to include index data, and no tool that would manage/use it.

Now about the few tools I investigated.

Apache Drill is an open-source implementation of Google Dremel, a database storage-less engine designed as a front-end scalable service. It is able to query data sources like Apache Parquet files, but can use indexes only through MapR (now HP Data Fabric). Drill does not manage index by itself.

Apache Spark is a framework that processes batches of data in distributed architecture. It does not maintain indexes on Apache Parquet either.

I investigated the Apache Pinot distributed database. It features nine types of indexes + bloom filter on WORM data. Up to now it is the only Big Data tool I have found that features clearly index management. I certainly miss many, and it is the purpose of this request.

What are the available options to benefit indexes on WORM data storage like Apache Parquet?

Tools implementing management and usage of indexes on WORM data storage like Apache Parquet files

Answers (0)

Related Questions