Reputation: 2093
I have a bunch of syslog data that looks something like this...
Mon Jan 1 00:00:01 UTC 1970 ServerName debug crond[123456]: System message telling me something
I'm not sure it's visible in the formatting here, but there is a tab character on each side of the ServerName splitting the string. So loading it initially is pretty easy...
A = LOAD '/syslogfiles' USING PigStorage('\t') AS (
date:chararray,
host:chararray,
message:chararray);
So now I have a tuple with 3 fields. Here's the next part I'm having trouble with. This is pseudo-code since I can't seem to get it right. I feel like EXTRACT may be what I'm looking for, but it doesn't turn out right.
What I want to do is split each of those fields up further, so like
B = FOREACH A <split> date USING PigStorage(' ') AS (
day:chararray,
month:chararray,
numday:int,
time:chararray,
timezone:chararray,
year:int);
So now I would have a tuple with 8 fields, (day, month, numday, time, timezone, year, host, message)
I assume if I wanted to use the same technique that answers this question, I could continue splitting the time by : if I wanted to, or the message with some value.
Upvotes: 1
Views: 1130
Reputation: 4125
The first method that comes to mind for a task like this is REGEX_EXTRACT() try somethin like this:
A = LOAD '/syslogfiles' USING PigStorage('\t') AS ( date:chararray, host:chararray, message:chararray);
B = foreach A generate REGEX_EXTRACT(date, '([A-Za-z]) [A-Za-z] [1-31] [1-9]:[1-9]:[1-9]* [A-Za-z]* [0-9]',1) as day:chararray, (date, '[A-Za-z] ([A-Za-z]) [1-31] [1-9]:[1-9*]:[1-9]* [A-Za-z]* [0-9]*',1) as month:chararray ...
something like the above could probly work, although my regex expressions could probably be made simpler if I thought about it for longer.
Upvotes: 1
Reputation: 5801
You are looking for the STRSPLIT
builtin UDF. This returns a tuple. It's basically a wrapper for Java's String.split()
. If you provide the limit
parameter, you will have a predictable length for your tuple, and then you can use FLATTEN
to promote the fields to the top level:
B =
FOREACH A
GENERATE
FLATTEN(STRSPLIT(date, ' ', 6)) AS (
day:chararray,
month:chararray,
numday:int,
time:chararray,
timezone:chararray,
year:int),
host,
message;
DESCRIBE B;
B: {day: chararray,month: chararray,numday: int,time: chararray,timezone: chararray,year: int,host: chararray,message: chararray}
Upvotes: 3