Bimal
Bimal

Reputation: 17

Getting Cast error in Pig script when trying to dump or store

I am getting the cast error after i create a join on two datasets in PIG script. The version i am using is HDP2.2 The error i am getting is :

ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 0: java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be cast to java.lang.String

The error i get when i try to DUMP or store. Please advise.

My script is as follow:

complaint= load 'file1' using PigStorage('|');
extracted = foreach complaint generate $13 as complainant_first_name:chararray, $14 as complainant_last_name:chararray, $16 as hic:chararray;
filtered_com = filter extracted by hic IS NOT NULL;

mbr= load 'file2' using PigStorage(',');
extracted = foreach mbr generate $11 as first_nm:chararray, $12 as last_nm:chararray, $24 as medcr_nbr:chararray;
filtered_mbr = filter extracted by medcr_nbr is not null;

joined = join filtered_com by hic, filtered_mbr by medcr_nbr;
describe joined;
store joined into 'com_mbr' using PigStorage(',') 

Upvotes: 0

Views: 1474

Answers (2)

CodeReaper
CodeReaper

Reputation: 387

The error that you are witnessing is this:

*Caused by: java.lang.ClassCastException: org.apache.pig.data.DataByteArray incompatible with java.lang.String*

Be default when you load data into pig it is stored in ByteArray Format. So to perform any String operation you need to typecast them to chararray.

You can get the output by either using an explicit cast to chararray tpye in foreach statement or by simply leaving the data in bytearray is given below:

complaint = LOAD'sofile1.txt' USING PigStorage('|'); // This loads all the data with bytearray is default data type.
extracted = FOREACH complaint GENERATE $0 AS(complaint_first_name,$1 AS(complaint_last_name),$2 as (hic);
filtered_com = filter extracted by hic IS NOT NULL;
mbr= load 'sofile2.txt' using PigStorage(',');
extracted = FOREACH mbr GENERATE $0 AS(first_nm),$1 AS (last_nm),$2 AS (medcr_nbr);
filtered_mbr = filter extracted by medcr_nbr is not null;
joined_data = JOIN filtered_com by hic,filtered_mbr by medcr_nbr;
describe joined;

This should print the results as well. Hope this helps.

Upvotes: 0

madhu
madhu

Reputation: 1170

We can specify the load for file1 with the column data types

complaint= load 'file1' using PigStorage('|') as (col0:chararray,col1:chararray;.........)

or

We can cast the columns data types in the for each block

extracted = foreach complaint generate (chararray)$13 as complainant_first_name:chararray,
(chararray)$14 as complainant_last_name:chararray,(chararray)$16 as hic:chararray

The same can be done for file2 as well. Hope this helps!!

Upvotes: 1

Related Questions