Reputation: 24144
I have around 1000 files
and each file is of the size of 1GB
. And I need to find a String in all these 1000 files
and also which files contains that particular String. I am working with Hadoop File System and all those 1000 files
are in Hadoop File System.
All the 1000 files
are under real folder, so If I do like this below, I will be getting all the 1000 files
. And I need to find which files contains a particular String hello under real folder.
bash-3.00$ hadoop fs -ls /technology/dps/real
And this is my data structure in hdfs-
row format delimited
fields terminated by '\29'
collection items terminated by ','
map keys terminated by ':'
stored as textfile
How I can write MapReduce jobs to do this particular problem so that I can find which files contains a particular string? Any simple example will be of great help to me.
Update:-
With the use of grep in Unix I can solve the above problem scenario, but it is very very slow and it takes lot of time to get the actual output-
hadoop fs -ls /technology/dps/real | awk '{print $8}' | while read f; do hadoop fs -cat $f | grep cec7051a1380a47a4497a107fecb84c1 >/dev/null && echo $f; done
So that is the reason I was looking for some MapReduce jobs to do this kind of problem...
Upvotes: 3
Views: 13758
Reputation: 13801
It sounds like you're looking for a grep-like program, which is easy to implement using Hadoop Streaming (the Hadoop Java API would work too):
First, write a mapper that outputs the name of the file being processed if the line being processed contains your search string. I used Python, but any language would work:
#!/usr/bin/env python
import os
import sys
SEARCH_STRING = os.environ["SEARCH_STRING"]
for line in sys.stdin:
if SEARCH_STRING in line.split():
print os.environ["map_input_file"]
This code reads the search string from the SEARCH_STRING
environmental variable. Here, I split the input line and check whether the search string matches any of the splits; you could change this to perform a substring search or use regular expressions to check for matches.
Next, run a Hadoop streaming job using this mapper and no reducers:
$ bin/hadoop jar contrib/streaming/hadoop-streaming-*.jar \
-D mapred.reduce.tasks=0
-input hdfs:///data \
-mapper search.py \
-file search.py \
-output /search_results \
-cmdenv SEARCH_STRING="Apache"
The output will be written in several parts; to obtain a list of matches, you can simply cat the files (provided they aren't too big):
$ bin/hadoop fs -cat /search_results/part-*
hdfs://localhost/data/CHANGES.txt
hdfs://localhost/data/CHANGES.txt
hdfs://localhost/data/ivy.xml
hdfs://localhost/data/README.txt
...
Upvotes: 4
Reputation: 7435
You can try something like this, though I'm not sure if it's an efficient way to go about it. Let me know if it works - I haven't tested it or anything.
You can use it like this: java SearchFiles /technology/dps/real hello making sure you run it from the appropriate directory of course.
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Scanner;
public class SearchFiles {
public static void main(String[] args) throws IOException {
if (args.length < 2) {
System.err.println("Usage: [search-dir] [search-string]");
return;
}
File searchDir = new File(args[0]);
String searchString = args[1];
ArrayList<File> matches = checkFiles(searchDir.listFiles(), searchString, new ArrayList<File>());
System.out.println("These files contain '" + searchString + "':");
for (File file : matches) {
System.out.println(file.getPath());
}
}
private static ArrayList<File> checkFiles(File[] files, String search, ArrayList<File> acc) throws IOException {
for (File file : files) {
if (file.isDirectory()) {
checkFiles(file.listFiles(), search, acc);
} else {
if (fileContainsString(file, search)) {
acc.add(file);
}
}
}
return acc;
}
private static boolean fileContainsString(File file, String search) throws IOException {
BufferedReader in = new BufferedReader(new FileReader(file));
String line;
while ((line = in.readLine()) != null) {
if (line.contains(search)) {
in.close();
return true;
}
}
in.close();
return false;
}
}
Upvotes: 0
Reputation: 39893
To get the filename you are currently processing, do:
((FileSplit) context.getInputSplit()).getPath().getName()
When you are searching your file record by record, when you see hello
, emit the above path (and maybe the line or anything else).
Set the number of reducers to 0, they aren't doing anything here.
Does 'row format delimited' mean that lines are delimited by a newline? in which case TextInputFormat
and LineRecordReader
work fine here.
Upvotes: 1