Reputation: 453

Indexing process in Hadoop

could any body please explain me what is meant by Indexing process in Hadoop. Is it something like a traditional indexing of data that we do in RDBMS, so drawing the same analogy here in Hadoop we index the data blocks and store the physical address of the blocks in some data structure. So it will be an additional space in the Cluster.

Googled around this topic but could not get any satisfactory and detailed things. Any pointers will help.

Thanks in advance

Upvotes: 3

Answers (2)

Vikas Singh

Reputation: 419

We can identify 2 different levels of granularity for creating indices: Index based on File URI or index based on InputSplit. Let’s take 2 different examples of data set.

index

First example:

2 files in your data set fit in 25 blocks, and have been identified as 7 different InputSplits. The target you are looking for (grey highlighted) is available on file #1 (block #2,#8 and #13), and on file #2 (block #17)

With File based indexing, you will end up with 2 files (full data set here), meaning that your indexed query will be equivalent to a full scan query With InputSplit based indexing, you will end up with 4 InputSplits on 7 available. The performance should be definitely better than doing a full scan query index

Let’s take a second example:

This time the same data set has been sorted by the column you want to index. The target you are looking for (grey highlighted) is now available on file #1 (block #1,#2,#3 and #4).

With File based indexing, you will end up with only 1 file from your data set With InputSplit based indexing, you will end up with 1 InputSplit on 7 available For this specific study, I decided to use a custom InputSplit based index. I believe such approach should be quite a good balance between the efforts it takes to implement, the added value it might bring in term of performance optimization, and its expected applicability regardless to the data distribution.

Upvotes: 4

user8485334

Reputation: 51

Hadoop stores data in files, and does not index them. To find something, we have to run a MapReduce job going through all the data. Hadoop is efficient where the data is too big for a database. With very large datasets, the cost of regenerating indexes is so high you can't easily index changing data.

However, we can use indexing in HDFS using two types viz. file based indexing & InputSplit based indexing. Lets assume that we have 2 Files to store in HDFS for processing. First one is of 500 MB and 2nd one is around 250 MB. Hence we'll have 4 InputSplits of 128MB each on 1st File and 3 InputSplits on 2nd file. We can apply 2 types of indexing for the mentioned case - 1. With File based indexing, you will end up with 2 files (full data set here), meaning that your indexed query will be equivalent to a full scan query 2. With InputSplit based indexing, you will end up with 4 InputSplits. The performance should be definitely better than doing a full scan query.

Now, to for implementing InputSplits index we need to perform following steps:

Build index from your full data set - This can be achived by writing a MapReduce job to extract the value we want to index, and output it together with its InputSplit MD5 hash.
Get the InputSplit(s) for the indexed value you are looking for - Output of MapReduce program will be Reduced Files (Containing Indices based on InputSplits) which will be stored in HDFS
Execute your actual MapReduce job on indexed InputSplits only. - This can be done by Hadoop as it is able to retrieve the number of InputSplit to be used using the FileInputFormat.class. We will create our own IndexFileInputFormat class extending the default FileInputFormat.class, and overriding its getSplits() method. You have to read the file you have created at previous step, add all your indexed InputSplits into a list, and then compare this list with the one returned by the super class. You will return to JobTracker only the InputSplits that were found in your index.
In Driver class we have now to use this IndexFileInputFormat class. We need to set as InputFormatClass using - To Use our custom IndexFileInputFormat In Driver class we need to provide job.setInputFormatClass(IndexFileInputFormat.class);

For Code Sample and other details Refer this -

https://hadoopi.wordpress.com/2013/05/24/indexing-on-mapreduce-2/

Upvotes: 5

Indexing process in Hadoop

Answers (2)

Related Questions