arsenal
arsenal

Reputation: 24144

Using Hadoop to find files that contain a particular string

I have around 1000 files and each file is of the size of 1GB. And I need to find a String in all these 1000 files and also which files contains that particular String. I am working with Hadoop File System and all those 1000 files are in Hadoop File System.

All the 1000 files are under real folder, so If I do like this below, I will be getting all the 1000 files. And I need to find which files contains a particular String hello under real folder.

bash-3.00$ hadoop fs -ls /technology/dps/real

And this is my data structure in hdfs-

row format delimited 
fields terminated by '\29'
collection items terminated by ','
map keys terminated by ':'
stored as textfile

How I can write MapReduce jobs to do this particular problem so that I can find which files contains a particular string? Any simple example will be of great help to me.

Update:-

With the use of grep in Unix I can solve the above problem scenario, but it is very very slow and it takes lot of time to get the actual output-

hadoop fs -ls /technology/dps/real | awk '{print $8}' | while read f; do hadoop fs -cat $f | grep cec7051a1380a47a4497a107fecb84c1 >/dev/null && echo $f; done

So that is the reason I was looking for some MapReduce jobs to do this kind of problem...

Upvotes: 3

Views: 13758

Answers (3)

Josh Rosen
Josh Rosen

Reputation: 13801

It sounds like you're looking for a grep-like program, which is easy to implement using Hadoop Streaming (the Hadoop Java API would work too):

First, write a mapper that outputs the name of the file being processed if the line being processed contains your search string. I used Python, but any language would work:

#!/usr/bin/env python
import os
import sys

SEARCH_STRING = os.environ["SEARCH_STRING"]

for line in sys.stdin:
    if SEARCH_STRING in line.split():
        print os.environ["map_input_file"]

This code reads the search string from the SEARCH_STRING environmental variable. Here, I split the input line and check whether the search string matches any of the splits; you could change this to perform a substring search or use regular expressions to check for matches.

Next, run a Hadoop streaming job using this mapper and no reducers:

$ bin/hadoop jar contrib/streaming/hadoop-streaming-*.jar \
    -D mapred.reduce.tasks=0
    -input hdfs:///data \
    -mapper search.py \
    -file search.py \
    -output /search_results \
    -cmdenv SEARCH_STRING="Apache"

The output will be written in several parts; to obtain a list of matches, you can simply cat the files (provided they aren't too big):

$ bin/hadoop fs -cat /search_results/part-*
hdfs://localhost/data/CHANGES.txt
hdfs://localhost/data/CHANGES.txt
hdfs://localhost/data/ivy.xml   
hdfs://localhost/data/README.txt
... 

Upvotes: 4

rtheunissen
rtheunissen

Reputation: 7435

You can try something like this, though I'm not sure if it's an efficient way to go about it. Let me know if it works - I haven't tested it or anything.

You can use it like this: java SearchFiles /technology/dps/real hello making sure you run it from the appropriate directory of course.

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Scanner;

public class SearchFiles {

    public static void main(String[] args) throws IOException {
        if (args.length < 2) {
            System.err.println("Usage: [search-dir] [search-string]");
            return;
        }
        File searchDir = new File(args[0]);
        String searchString = args[1];
        ArrayList<File> matches = checkFiles(searchDir.listFiles(), searchString, new ArrayList<File>());
        System.out.println("These files contain '" + searchString + "':");
        for (File file : matches) {
            System.out.println(file.getPath());
        }
    }

    private static ArrayList<File> checkFiles(File[] files, String search, ArrayList<File> acc) throws IOException {
        for (File file : files) {
            if (file.isDirectory()) {
                checkFiles(file.listFiles(), search, acc);
            } else {
                if (fileContainsString(file, search)) {
                    acc.add(file);
                }
            }
        }
        return acc;
    }

    private static boolean fileContainsString(File file, String search) throws IOException {
        BufferedReader in = new BufferedReader(new FileReader(file));
        String line;
        while ((line = in.readLine()) != null) {
            if (line.contains(search)) {
                in.close();
                return true;
            }
        }
        in.close();
        return false;
    }
}

Upvotes: 0

Donald Miner
Donald Miner

Reputation: 39893

To get the filename you are currently processing, do:

((FileSplit) context.getInputSplit()).getPath().getName() 

When you are searching your file record by record, when you see hello, emit the above path (and maybe the line or anything else).

Set the number of reducers to 0, they aren't doing anything here.


Does 'row format delimited' mean that lines are delimited by a newline? in which case TextInputFormat and LineRecordReader work fine here.

Upvotes: 1

Related Questions