GC Co
GC Co

Reputation: 1

Pig or Hive for a file manipulation

I have a file composed as follows:

&009:65 

34KKll90JJKK87LLOO

%(..)?.I$£.....

&013:35

36KKll90TTYY87LLPP

%%(.9)?'


&025:66

55KKll88ZZYY87MMQQ

%&(.9)?%%??-_'

And I would like to get a file as:

&009:65 34KKll90JJKK87LLOO  %(..)?.I$£.....

&013:35 36KKll90TTYY87LLPP  %%(.9)?'.......

&025:66 55KKll88ZZYY87MMQQ  %&(.9)?%%??-_'.......

I use hortonworks and I would like to know if it's better to use Hive or PIG and how I could achieve this using one or the other?

Upvotes: 0

Views: 74

Answers (1)

Samson Scharfrichter
Samson Scharfrichter

Reputation: 9067

Hive, Pig, and the whole Hadoop ecosystem expect files with single-line records, so that you can split the file arbitrarily on any line break and process the splits separately with an arbitrary number of Mappers.

Your example has logical records spanned on multiple lines. Not splittable stuff. Cannot be processed easily in a distributed way. Game over.

Workaround: start a shell somewhere, download the ugly stuff locally, rebuild consistent records with good old sed or awk utilities, and upload the result. Then you can read it with Hive or Pig.

Sample sed command line (awk would be overkill IMHO)...

sed -n '/^&/ { N ; N ; N ; N ; s/\n\n/ /g ; p }' UglyStuff.dump > NiceStuff.txt

If you prefer one-liners:

hdfs dfs -cat /some/path/UglyStuff.dump | sed -n '/^&/ { N ; N ; N ; N ; s/\n\n/ /g ; p }' | hdfs dfs -put -f - /different/path/NiceStuff.txt

Upvotes: 1

Related Questions