BeanBagKing
BeanBagKing

Reputation: 2093

Split tuple fields into further fields after LOAD

I have a bunch of syslog data that looks something like this...

Mon Jan 1 00:00:01 UTC 1970 ServerName debug crond[123456]: System message telling me something

I'm not sure it's visible in the formatting here, but there is a tab character on each side of the ServerName splitting the string. So loading it initially is pretty easy...

A = LOAD '/syslogfiles' USING PigStorage('\t') AS (
date:chararray,
host:chararray,
message:chararray);

So now I have a tuple with 3 fields. Here's the next part I'm having trouble with. This is pseudo-code since I can't seem to get it right. I feel like EXTRACT may be what I'm looking for, but it doesn't turn out right.

What I want to do is split each of those fields up further, so like

B = FOREACH A <split> date USING PigStorage(' ') AS (
day:chararray,
month:chararray,
numday:int,
time:chararray,
timezone:chararray,
year:int);

So now I would have a tuple with 8 fields, (day, month, numday, time, timezone, year, host, message)

I assume if I wanted to use the same technique that answers this question, I could continue splitting the time by : if I wanted to, or the message with some value.

Upvotes: 1

Views: 1130

Answers (2)

Davis Broda
Davis Broda

Reputation: 4125

The first method that comes to mind for a task like this is REGEX_EXTRACT() try somethin like this:

A = LOAD '/syslogfiles' USING PigStorage('\t') AS ( date:chararray, host:chararray, message:chararray);

B = foreach A generate REGEX_EXTRACT(date, '([A-Za-z]) [A-Za-z] [1-31] [1-9]:[1-9]:[1-9]* [A-Za-z]* [0-9]',1) as day:chararray, (date, '[A-Za-z] ([A-Za-z]) [1-31] [1-9]:[1-9*]:[1-9]* [A-Za-z]* [0-9]*',1) as month:chararray ...

something like the above could probly work, although my regex expressions could probably be made simpler if I thought about it for longer.

Upvotes: 1

reo katoa
reo katoa

Reputation: 5801

You are looking for the STRSPLIT builtin UDF. This returns a tuple. It's basically a wrapper for Java's String.split(). If you provide the limit parameter, you will have a predictable length for your tuple, and then you can use FLATTEN to promote the fields to the top level:

B =
    FOREACH A
    GENERATE
        FLATTEN(STRSPLIT(date, ' ', 6)) AS (
            day:chararray,
            month:chararray,
            numday:int,
            time:chararray,
            timezone:chararray,
            year:int),
        host,
        message;

DESCRIBE B;
B: {day: chararray,month: chararray,numday: int,time: chararray,timezone: chararray,year: int,host: chararray,message: chararray}

Upvotes: 3

Related Questions