Reputation: 3483
I'm trying to code the following in Hadoop map-reduce
. I have a log file which contains IP addresses and the urls opened by the respective IP following it. It is as follows:
192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com
192.168.198.92 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.facebook.com
192.168.198.92 www.indiabix.com
192.168.72.177 www.indiabix.com
192.168.72.224 www.google.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.facebook.com
192.168.198.92 www.gmail.com
192.168.72.177 www.facebook.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.72.224 www.yahoo.com
192.168.72.177 www.google.com
192.168.72.177 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com
Now I need to organize the results from this file in such a way that it lists the different IP addresses with the Urls followed by the number of times that particular is opened by that IP.
For example, if 192.168.72.224
opens www.yahoo.com
15 times as per the whole log file, then the output must contain :
192.168.72.224 www.yahoo.com 15
This should be done for all the IPs in the file and the final output should look like :
192.168.72.224 www.yahoo.com 15
www.m4maths.com 11
192.168.72.177 www.yahoo.com 6
www.gmail.com 19
....
...
..
.
the code that I've tried is:
public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
I know this code is seriously flawed, please suggest me an idea to move forward.
Thank you.
Upvotes: 0
Views: 1768
Reputation: 419
I have written the same logic in java
public class UrlHitMapper extends Mapper<Object, Text, Text, Text>{
public void map(Object key, Text value, Context contex) throws IOException, InterruptedException {
System.out.println(value);
StringTokenizer st=new StringTokenizer(value.toString());
if(st.hasMoreTokens())
contex.write(new Text(st.nextToken()), new Text(st.nextToken()));
}
}
public class UrlHitReducer extends Reducer<Text, Text, Text, Text>{
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
HashMap<String, Integer> urlCount=new HashMap<>();
String url=null;
Iterator<Text> it=values.iterator();
while (it.hasNext()) {
url=it.next().toString();
if(urlCount.get(url)==null)
urlCount.put(url, 1);
else
urlCount.put(url,urlCount.get(url)+1);
}
for(Entry<String, Integer> k:urlCount.entrySet())
context.write(key, new Text(k.getKey()+" "+k.getValue()));
}
}
public class UrlHitCount extends Configured implements Tool {
public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new UrlHitCount(), args);
}
public int run(String[] arg0) throws Exception {
Job job = Job.getInstance(getConf());
job.setJobName("url-hit-count");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(UrlHitMapper.class);
job.setReducerClass(UrlHitReducer.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path("input/urls"));
FileOutputFormat.setOutputPath(job, new Path("url_otput"+System.currentTimeMillis()));
job.setJarByClass(WordCount.class);
job.submit();
return 1;
}
}
Upvotes: 1
Reputation: 5018
I would propose this design:
Implementing this would require you to implement custom writable to handle a pair of .
Personally I'd do this with Spark unless you are too concerned about the performance. With PySpark it would be as simple as this:
rdd = sc.textFile('/sparkdemo/log.txt')
counts = rdd.map(lambda line: line.split()).map(lambda line: ((line[0], line[1]), 1)).reduceByKey(lambda x, y: x+y)
result = counts.map(lambda ((ip, url), cnt): (ip, (url, cnt))).groupByKey().collect()
for x in result:
print 'IP: %s' % x[0]
for w in x[1]:
print ' website: %s count: %d' % (w[0], w[1])
The output for your example would be:
IP: 192.168.72.224
website: www.facebook.com count: 2
website: www.m4maths.com count: 2
website: www.google.com count: 5
website: www.gmail.com count: 4
website: www.indiabix.com count: 8
website: www.yahoo.com count: 3
IP: 192.168.72.177
website: www.yahoo.com count: 14
website: www.google.com count: 3
website: www.facebook.com count: 3
website: www.m4maths.com count: 3
website: www.indiabix.com count: 1
IP: 192.168.198.92
website: www.facebook.com count: 4
website: www.m4maths.com count: 3
website: www.yahoo.com count: 3
website: www.askubuntu.com count: 2
website: www.indiabix.com count: 1
website: www.google.com count: 5
website: www.gmail.com count: 1
Upvotes: 1