Pedro Alves
Pedro Alves

Reputation: 1054

My Pig Script is creating empty files into HDFS

I've this statement:

--Insert a new column based on filename
Data = LOAD '/user/cloudera/Source_Data' using PigStorage('\t','-tagFile');

Data_Schema = FOREACH Data GENERATE
(chararray)$1 AS Date,
(chararray)$2 AS ID,
(chararray)$3 AS Interval,
(chararray)$4 AS Code,
(chararray)$5 AS S_In,
(chararray)$6 AS S_Out,
(chararray)$7 AS C_In,
(chararray)$8 AS C_Out,
(chararray)$9 AS Traffic;

--Split into different directories
SPLIT Data_Schema INTO Src1 IF (Date == '2016-06-25.txt'),
Src2 IF (Date == '2014-07-31.txt'),
Src3 IF (Date == '2016-01-01.txt');

STORE Src1 INTO '/user/cloudera/Source_DatA/2016-06-25' using PigStorage('\t');
STORE Src2 INTO '/user/cloudera/Source_Data/2014-07-31.txt' using PigStorage('\t');
STORE Src2 INTO '/user/cloudera/Source_Data/2016-01-01' using PigStorage('\t');

And there is a example of my orignally source data:

10000 1388530800000 39 8.600870350350515 13.86183926855984 1.7218329193014124 3.424444103320796 25.972920214509095

But when I execute it runs successfully, however the files in HDFS are without data...

Note that I add a new column based on filename. That's why I've one more column in Foreach Statment...

Upvotes: 0

Views: 324

Answers (1)

54l3d
54l3d

Reputation: 3973

If your input files named 2016-06-25.txt, 2014-07-31.txt and 2016-01-01.txt then the new added column will be referenced by $0 and it will contain files names. You have to do like this :

Data_Schema = FOREACH Data GENERATE
(chararray)$0 AS Date,
(chararray)$1 AS ID,
...

Or simply specify schema while loading files and keep the rest as it is:

Data = LOAD '/user/cloudera/Source_Data' using PigStorage('\t','-tagFile') as (Date:chararray, ID:chararray, ... ;

Upvotes: 1

Related Questions