Aditya Vikas Devarapalli
Aditya Vikas Devarapalli

Reputation: 3483

Hit Count extraction from log files using mapreduce

I'm trying to code the following in Hadoop map-reduce. I have a log file which contains IP addresses and the urls opened by the respective IP following it. It is as follows:

192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com
192.168.198.92 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.facebook.com
192.168.198.92 www.indiabix.com
192.168.72.177 www.indiabix.com
192.168.72.224 www.google.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.facebook.com
192.168.198.92 www.gmail.com
192.168.72.177 www.facebook.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.72.224 www.yahoo.com
192.168.72.177 www.google.com
192.168.72.177 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com

Now I need to organize the results from this file in such a way that it lists the different IP addresses with the Urls followed by the number of times that particular is opened by that IP.

For example, if 192.168.72.224 opens www.yahoo.com 15 times as per the whole log file, then the output must contain :

192.168.72.224 www.yahoo.com 15

This should be done for all the IPs in the file and the final output should look like :

192.168.72.224 www.yahoo.com 15
               www.m4maths.com 11
192.168.72.177 www.yahoo.com 6
               www.gmail.com 19
....
...
..
.

the code that I've tried is:

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
            private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();
                 public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
      {
                       String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);

                     while (tokenizer.hasMoreTokens())
            {
               word.set(tokenizer.nextToken());
                              output.collect(word, one);
            }
       }
}

I know this code is seriously flawed, please suggest me an idea to move forward.

Thank you.

Upvotes: 0

Views: 1768

Answers (2)

Vikas Singh
Vikas Singh

Reputation: 419

I have written the same logic in java

public class UrlHitMapper extends Mapper<Object, Text, Text, Text>{

    public void map(Object key, Text value, Context contex) throws IOException, InterruptedException {

        System.out.println(value);
        StringTokenizer st=new StringTokenizer(value.toString());

        if(st.hasMoreTokens())
            contex.write(new Text(st.nextToken()), new Text(st.nextToken()));

    }
}

public class UrlHitReducer extends Reducer<Text, Text, Text, Text>{

    public void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {

        HashMap<String, Integer> urlCount=new HashMap<>();
        String url=null;

        Iterator<Text> it=values.iterator();

        while (it.hasNext()) {

            url=it.next().toString();

            if(urlCount.get(url)==null)
                urlCount.put(url, 1);
            else
                urlCount.put(url,urlCount.get(url)+1);
        }

        for(Entry<String, Integer> k:urlCount.entrySet())
        context.write(key, new Text(k.getKey()+"    "+k.getValue()));
    }
}

public class UrlHitCount extends Configured implements Tool {

    public static void main(String[] args) throws Exception {

        ToolRunner.run(new Configuration(), new UrlHitCount(), args);
    }

    public int run(String[] arg0) throws Exception {


        Job job = Job.getInstance(getConf());

        job.setJobName("url-hit-count");

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        job.setMapperClass(UrlHitMapper.class);

        job.setReducerClass(UrlHitReducer.class);  

        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path("input/urls"));
        FileOutputFormat.setOutputPath(job, new Path("url_otput"+System.currentTimeMillis()));

        job.setJarByClass(WordCount.class);
        job.submit();

        return 1;
    }

}

Upvotes: 1

0x0FFF
0x0FFF

Reputation: 5018

I would propose this design:

  1. Mapper gets a line from file and outputs IP as a key and a pair of website and 1 as a value
  2. Combiner and Reducer. Gets IP as a key and a sequence of (website, count) pairs, aggregates them by website (using HashMap) and outputs IP, website and count as an output.

Implementing this would require you to implement custom writable to handle a pair of .

Personally I'd do this with Spark unless you are too concerned about the performance. With PySpark it would be as simple as this:

rdd = sc.textFile('/sparkdemo/log.txt')
counts = rdd.map(lambda line: line.split()).map(lambda line: ((line[0], line[1]), 1)).reduceByKey(lambda x, y: x+y)
result = counts.map(lambda ((ip, url), cnt): (ip, (url, cnt))).groupByKey().collect()
for x in result:
    print 'IP: %s' % x[0]
    for w in x[1]:
        print '    website: %s count: %d' % (w[0], w[1])

The output for your example would be:

IP: 192.168.72.224
    website: www.facebook.com count: 2
    website: www.m4maths.com count: 2
    website: www.google.com count: 5
    website: www.gmail.com count: 4
    website: www.indiabix.com count: 8
    website: www.yahoo.com count: 3
IP: 192.168.72.177
    website: www.yahoo.com count: 14
    website: www.google.com count: 3
    website: www.facebook.com count: 3
    website: www.m4maths.com count: 3
    website: www.indiabix.com count: 1
IP: 192.168.198.92
    website: www.facebook.com count: 4
    website: www.m4maths.com count: 3
    website: www.yahoo.com count: 3
    website: www.askubuntu.com count: 2
    website: www.indiabix.com count: 1
    website: www.google.com count: 5
    website: www.gmail.com count: 1

Upvotes: 1

Related Questions