Grouping Log Entries By Month And IP Using Hadoop MapReduce

Question

I'm really new to Apache Hadoop. But I want to learn how to use it to summarize my machine logs. Actually it's not big enough (GBs) and I could just parse it and wait for, like, hours. But I think learning Hadoop might be useful.

So, I have a log entries with format like below.

Location, Date, IP Address

e.g.

New York, 2011-11-31 10:50:59, 1.1.1.1
New York, 2011-11-31 10:51:01, 1.1.1.2
Chicago, 2011-11-31 10:52:02, 1.1.1.1
Los Angeles, 2011-11-31 10:53:04, 1.1.1.4

And I want to aggregate it by Location, grouped by months, and then by the IP address. Below is something in my mind of what it would look like.

  Location, Month, IP,   Count
+ New York
|   +-- November 2011
|   |     +---- 1.1.1.1    5
|   |     +---- 1.1.1.2    2
|   |     +---- 1.1.1.3    7
|   +-- December 2011
|   |     +---- 1.1.1.1    6
|   |     +---- 1.1.1.4    6
|   +-- January 2012
|         +---- 1.1.1.1   10 
+ Chicago 
|   +-- November 2011
|   |     +---- 1.1.1.1   20
|   |     +---- 1.1.1.2    1
|   |     +---- 1.1.1.3   10
(so on)

My questions are :

Can I do this using Hadoop or is there a better way to do it?
What is the common way to do this using Hadoop?

Thank you for giving a pointer to a link or article or a sample code.

Charles Menguy · Accepted Answer

can I do this using Hadoop or is there a better way to do it?

You definitely can use Hadoop for this, if you only got a few Gbs it's probably not that necessary, but what you gain with doing this with Hadoop is that you will be able to scale easily, let's say tomorrow you have to do the same on 500Gb, you would potentially have nothing to change in your code, just the hardware and configuration.

what is the common way to do this using Hadoop?

I don't think there's a "common way" so to speak, Hadoop is a framework encapsulating multiple projects, you could do this in Map/Reduce, Hive, Pig, ...

I think your use case lends itself pretty well to doing this using Hive since you want to do aggregations and have a structure that can easily it into tables, and if you are new to Hadoop you can have the familiarity with SQL, so here are some hints.

Upload these logs into HDFS. This is the number one step required regardless of how you want to do the processing, HDFS is a distributed file system so your logs will be split in blocks across your cluster and replicated.
```
hadoop fs -put /path/to/your/log/directory /user/hive/warehouse/logs
```

Create a table in Hive. You have to set it external to the location where you put your logs in HDFS (and specify the delimiter you have in your files):

hive -e "create external table logs(location string, day string, ip string) row format delimited fields terminated by ',' location /user/hive/warehouse/logs"

Now you can do some queries on your data ! In your example you should do the following:
```
hive -e "select location, month(day), ip, count(*) from logs group by location, month(day), ip order by location, month, ip"
```
Note that I'm calling MONTH() on the day to extra the month part of the day for the aggregation, this is what Hive calls UDFs.

Even if you are writing SQL queries, this will generate under the hood Map/Reduce jobs that will run on your cluster and so your job will scale based on the size of your cluster.

I hope that makes sense, if you want more details on Hive I'd like to redirect you to the Hive DDL description as well as the official project page.

Grouping Log Entries By Month And IP Using Hadoop MapReduce

Answers (1)

Related Questions