Reputation: 1
I have a file composed as follows:
&009:65
34KKll90JJKK87LLOO
%(..)?.I$£.....
&013:35
36KKll90TTYY87LLPP
%%(.9)?'
&025:66
55KKll88ZZYY87MMQQ
%&(.9)?%%??-_'
And I would like to get a file as:
&009:65 34KKll90JJKK87LLOO %(..)?.I$£.....
&013:35 36KKll90TTYY87LLPP %%(.9)?'.......
&025:66 55KKll88ZZYY87MMQQ %&(.9)?%%??-_'.......
I use hortonworks and I would like to know if it's better to use Hive or PIG and how I could achieve this using one or the other?
Upvotes: 0
Views: 74
Reputation: 9067
Hive, Pig, and the whole Hadoop ecosystem expect files with single-line records, so that you can split the file arbitrarily on any line break and process the splits separately with an arbitrary number of Mappers.
Your example has logical records spanned on multiple lines. Not splittable stuff. Cannot be processed easily in a distributed way. Game over.
Workaround: start a shell somewhere, download the ugly stuff locally, rebuild consistent records with good old sed or awk utilities, and upload the result. Then you can read it with Hive or Pig.
Sample sed command line (awk would be overkill IMHO)...
sed -n '/^&/ { N ; N ; N ; N ; s/\n\n/ /g ; p }' UglyStuff.dump > NiceStuff.txt
If you prefer one-liners:
hdfs dfs -cat /some/path/UglyStuff.dump | sed -n '/^&/ { N ; N ; N ; N ; s/\n\n/ /g ; p }' | hdfs dfs -put -f - /different/path/NiceStuff.txt
Upvotes: 1