Reputation: 187
I have been trying to load the file from HDFS and check the output using Dump. But I am not getting the desire output. My input file ('/results') looks like this:
1 fail
2 fail
3 pass
4 pass
5 fail
6 pass
7 fail
8 pass
9 pass
10 pass
11 pass
12 fail
13 fail
14 fail
15 pass
16 pass
17 pass
18 pass
19 pass
20 fail
And this the pig command I am coding:
A = LOAD '/results' using PigStorage() as (f1:int, f2:chararray);
Dump A;
But I am getting the output as follows:
(1,fail)
(,)
(2,fail)
(,)
(3,pass)
(,)
(4,pass)
(,)
(5,fail)
(,)
(6,pass )
(,)
(7,fail)
(,)
(8,pass)
(,)
(9,pass)
(,)
(10,pass)
(,)
(11,pass)
(,)
(12,fail)
(,)
(13,fail)
(,)
(14,fail)
(,)
(15,pass)
(,)
(16,pass)
(,)
(17,pass)
(,)
(18,pass)
(,)
(19,pass)
(,)
(20,fail)
I really don't understand from where "(,)" has come between two tuples. Can someone help me out ?
Thanks.
Upvotes: 0
Views: 103
Reputation: 16
You have to specify correct delimiter in PigStorage() method in order to read the file contents correctly. You need to modify that method based on the delimiter you have in your input data like:
For single space:
INPUT = LOAD '/results' USING PigStorage(' ') AS (f1: int, f2:chararray);
DUMP INPUT
For tab delimited:
INPUT = LOAD '/results' USING PigStorage('\t') AS (f1:int, f2:chararray);
DUMP INPUT;
For the second part that you are getting (,) in the output, I see an empty line between each two lines in your input data.
Solution:
Logically filter out the null records (Considering delimiter is a tab):
INPUT = LOAD '/results' USING PigStorage('\t') AS (f1: int, f2: chararray);
INPUT = FILTER INPUT BY f2 IS NOT NULL;
DUMP INPUT;
Thank You.
Upvotes: 0
Reputation: 11080
You have to specify the delimiter between the columns in your input file in PigStorage.Assuming your columns are separated by a single space
A = LOAD '/results' USING PigStorage(' ') as (f1:int, f2:chararray);
DUMP A;
If it is a tab
A = LOAD '/results' USING PigStorage('\t') as (f1:int, f2:chararray);
DUMP A;
Upvotes: 0