krisdigitx
krisdigitx

Reputation: 7136

hadoop suggestions on how to process logs

I need some suggestions on how I should process infrastructure logs using hadoop in Java instead of Pig as I think Pig does not support regex filters while reading log files.

As an example, i have cisco logs and web server logs and I want to filter specific values by line and feed into hadoop.

There are couple of suggestions online i.e to first change it to csv format, but what if the log file is in GBs???

Is it possible to filter the lines at "map" stage i.e the program will read lines from the file in HDFS and send it to mapper...

I need some suggestions on best way and clean way to do this....

thanks.

Upvotes: 1

Views: 164

Answers (1)

Jagadish Talluri
Jagadish Talluri

Reputation: 688

We can do REGEX operations on PIG. PIG internally uses JAVA REGEX specifications only.

Please go through the following example:

myfile = LOAD '999999-99999-2007' AS (a:chararray);

filterfile = FILTER myfile BY a MATCHES '.*DAY+.*';

selectfile = FOREACH filterfile GENERATE a, SIZE(a) AS size;

STORE selectfile INTO '/home/jagadish/selectfile';

File used in the example is of size 2.7 GB contains 11 million lines. Out of which regex output is 450,000 lines.

I believe this answers your question, Otherwise please let me know.

Upvotes: 6

Related Questions