Reputation: 453
could any body please explain me what is meant by Indexing process in Hadoop. Is it something like a traditional indexing of data that we do in RDBMS, so drawing the same analogy here in Hadoop we index the data blocks and store the physical address of the blocks in some data structure. So it will be an additional space in the Cluster.
Googled around this topic but could not get any satisfactory and detailed things. Any pointers will help.
Thanks in advance
Upvotes: 3
Views: 7935
Reputation: 419
We can identify 2 different levels of granularity for creating indices: Index based on File URI or index based on InputSplit. Let’s take 2 different examples of data set.
index
First example:
2 files in your data set fit in 25 blocks, and have been identified as 7 different InputSplits. The target you are looking for (grey highlighted) is available on file #1 (block #2,#8 and #13), and on file #2 (block #17)
With File based indexing, you will end up with 2 files (full data set here), meaning that your indexed query will be equivalent to a full scan query With InputSplit based indexing, you will end up with 4 InputSplits on 7 available. The performance should be definitely better than doing a full scan query index
Let’s take a second example:
This time the same data set has been sorted by the column you want to index. The target you are looking for (grey highlighted) is now available on file #1 (block #1,#2,#3 and #4).
With File based indexing, you will end up with only 1 file from your data set With InputSplit based indexing, you will end up with 1 InputSplit on 7 available For this specific study, I decided to use a custom InputSplit based index. I believe such approach should be quite a good balance between the efforts it takes to implement, the added value it might bring in term of performance optimization, and its expected applicability regardless to the data distribution.
Upvotes: 4
Reputation: 51
Hadoop stores data in files, and does not index them. To find something, we have to run a MapReduce job going through all the data. Hadoop is efficient where the data is too big for a database. With very large datasets, the cost of regenerating indexes is so high you can't easily index changing data.
However, we can use indexing in HDFS using two types viz. file based indexing & InputSplit based indexing. Lets assume that we have 2 Files to store in HDFS for processing. First one is of 500 MB and 2nd one is around 250 MB. Hence we'll have 4 InputSplits of 128MB each on 1st File and 3 InputSplits on 2nd file. We can apply 2 types of indexing for the mentioned case - 1. With File based indexing, you will end up with 2 files (full data set here), meaning that your indexed query will be equivalent to a full scan query 2. With InputSplit based indexing, you will end up with 4 InputSplits. The performance should be definitely better than doing a full scan query.
Now, to for implementing InputSplits index we need to perform following steps:
For Code Sample and other details Refer this -
https://hadoopi.wordpress.com/2013/05/24/indexing-on-mapreduce-2/
Upvotes: 5