El Capitan
El Capitan

Reputation: 105

Datetime parsing in Apache Pig

I'm trying to parse a Date in a Pig script and i got the following error "Hadoop does not return any error message".

Here is the Date format example : 3/9/16 2:50 PM

And here is how I parse it :

data = LOAD 'cleaned.txt'
AS (Date, Block, Primary_Type, Description, Location_Description, Arrest, Domestic, District, Year);

times = FOREACH data GENERATE ToDate(Date, 'M/d/yy h:mm a') As Time;

You can see the data file here

Do you have any idea ? Thanks


EDIT:

It look like the error is caused by the STORE command on "times".

If I do a DUMP then I got:

ERROR 1066: Unable to open iterator for alias times

It happen only when I use the ToDate function, I have other scripts that work perfectly.

Upvotes: 0

Views: 721

Answers (1)

kecso
kecso

Reputation: 2485

First of all, you need to specify the loader in the LOAD statement:

USING PigStorage('\t')

I assumed that you're using tab separator. Than if you have no schema specify the schema with type!

So you're load statement will be sg like this:
data = LOAD 'SO/date2parse.txt' USING PigStorage('\t') AS (Date:chararray, Block:chararray, Primary_Type:chararray, Description:chararray, Location_Description:chararray, Arrest:chararray, Domestic:chararray, District:chararray, Year:chararray);

For now I just use chararray type for everything, but you have to specify the type what is the right representation for you.

After this the date conversion just works fine as you wrote: (2016-03-09T23:55:00.000Z) (2016-03-09T23:55:00.000Z) (2016-03-09T23:55:00.000Z)

My test script:

data = LOAD 'SO/date2parse.txt' USING PigStorage('\t') AS (Date:chararray, Block:chararray, Primary_Type:chararray, Description:chararray, Location_Description:chararray, Arrest:chararray, Domestic:chararray, District:chararray, Year:chararray);
times = FOREACH data GENERATE ToDate(Date, 'M/d/yy h:mm a') As Time;
DUMP times;

UPDATE: Some explanation

By the way the default loader is pig storage

PigStorage is the default load function for the LOAD operator.

but it's nicer to specify. The original issue caused by the lack of datatype

If you don't assign types, fields default to type bytearray

so the ToDate failed on the input type.

Upvotes: 2

Related Questions