Hit Count extraction from log files using mapreduce

Question

I'm trying to code the following in Hadoop map-reduce. I have a log file which contains IP addresses and the urls opened by the respective IP following it. It is as follows:

192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com
192.168.198.92 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.facebook.com
192.168.198.92 www.indiabix.com
192.168.72.177 www.indiabix.com
192.168.72.224 www.google.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.facebook.com
192.168.198.92 www.gmail.com
192.168.72.177 www.facebook.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.72.224 www.yahoo.com
192.168.72.177 www.google.com
192.168.72.177 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com

Now I need to organize the results from this file in such a way that it lists the different IP addresses with the Urls followed by the number of times that particular is opened by that IP.

For example, if 192.168.72.224 opens www.yahoo.com 15 times as per the whole log file, then the output must contain :

192.168.72.224 www.yahoo.com 15

This should be done for all the IPs in the file and the final output should look like :

192.168.72.224 www.yahoo.com 15
               www.m4maths.com 11
192.168.72.177 www.yahoo.com 6
               www.gmail.com 19
....
...
..
.

the code that I've tried is:

public class WordCountMapper extends MapReduceBase implements Mapper
{
            private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();
                 public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException
      {
                       String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);

                     while (tokenizer.hasMoreTokens())
            {
               word.set(tokenizer.nextToken());
                              output.collect(word, one);
            }
       }
}

I know this code is seriously flawed, please suggest me an idea to move forward.

Thank you.

0x0FFF · Accepted Answer

I would propose this design:

Mapper gets a line from file and outputs IP as a key and a pair of website and 1 as a value
Combiner and Reducer. Gets IP as a key and a sequence of (website, count) pairs, aggregates them by website (using HashMap) and outputs IP, website and count as an output.

Implementing this would require you to implement custom writable to handle a pair of .

Personally I'd do this with Spark unless you are too concerned about the performance. With PySpark it would be as simple as this:

rdd = sc.textFile('/sparkdemo/log.txt')
counts = rdd.map(lambda line: line.split()).map(lambda line: ((line[0], line[1]), 1)).reduceByKey(lambda x, y: x+y)
result = counts.map(lambda ((ip, url), cnt): (ip, (url, cnt))).groupByKey().collect()
for x in result:
    print 'IP: %s' % x[0]
    for w in x[1]:
        print '    website: %s count: %d' % (w[0], w[1])

The output for your example would be:

IP: 192.168.72.224
    website: www.facebook.com count: 2
    website: www.m4maths.com count: 2
    website: www.google.com count: 5
    website: www.gmail.com count: 4
    website: www.indiabix.com count: 8
    website: www.yahoo.com count: 3
IP: 192.168.72.177
    website: www.yahoo.com count: 14
    website: www.google.com count: 3
    website: www.facebook.com count: 3
    website: www.m4maths.com count: 3
    website: www.indiabix.com count: 1
IP: 192.168.198.92
    website: www.facebook.com count: 4
    website: www.m4maths.com count: 3
    website: www.yahoo.com count: 3
    website: www.askubuntu.com count: 2
    website: www.indiabix.com count: 1
    website: www.google.com count: 5
    website: www.gmail.com count: 1

Hit Count extraction from log files using mapreduce

Answers (2)

Related Questions