Apache Pig: How to load a sequence file which is stored in hdfs?

Question

My sequence files are stored directly in hdfs e.g.:

grunt> ls   
grunt> ls /blabla
hdfs://namenode1:54310/blabla/0411f03a-db7f-48d0-9542-5203304e3e81.seq 185284523
hdfs://namenode1:54310/blabla/05be8fc0-e967-42e1-b76a-0d7108a69d17.seq 201489688
hdfs://namenode1:54310/blabla/06222427-519c-49c0-bbbf-49a9f43bbd13.seq 196858576
hdfs://namenode1:54310/blabla/066da26a-48da-45b1-83f5-60d16475e40d.seq 194832641
hdfs://namenode1:54310/blabla/07cbfc83-42a2-47bf-b364-d39da3a2d071.seq 194806047
hdfs://namenode1:54310/blabla/10dea7b8-9ed3-4e66-b4bd-a3c07d8bf39e.seq 166224702

How can I create a Pig script which is reading every file from the directory "blabla" and performing an action?

I've tried multiple ways for loading the input but none of those worked:

%default INPUT '/blabla/f8fbbe9a-aae3-413f-b3b9-37cdef71da8f.seq'
%default INPUT 'hdfs://namenode1:54310/blabla/f8fbbe9a-aae3-413f-b3b9-37cdef71da8f.seq'
%default INPUT 'f8fbbe9a-aae3-413f-b3b9-37cdef71da8f.seq'

I always get the error:

Input(s):

Failed to read data from "hdfs://namenode1:54310/........."

AntonyBrd · Accepted Answer

Did you try this way :

%default INPUT 'hdfs://namenode1:54310/blabla/*'

?

It should work if your .seq files are readables. It looks like they are not, because your attempt to do it should have load one file. Could-you give the complete log line?

Maybe you would have to use pig SequenceFileLoader.

Apache Pig: How to load a sequence file which is stored in hdfs?

Answers (2)

Related Questions