user3072054
user3072054

Reputation: 349

Split file into 4 equal parts using apache Pig

I want to split a file into 4 equal parts using Apache pig. Example, if a file has 100 lines the first 25 should go to the 1st output file and so on.. the last 25 lines should go to the 4th output file. Can someone help me to achieve this. I am using Apache pig because the number of records in the file will be in Millions and there are previous steps that generate the file that needs to be split uses Pig.

Upvotes: 2

Views: 1903

Answers (4)

Arthur M
Arthur M

Reputation: 458

I did a bit of digging on this, because it comes up the the Hortonworks sample exam for hadoop. It doesn't seem to be well documented - but its quite simple really. In this example I was using the Country sample database offered for download on dev.mysql.com:

grunt> storeme = order data by $0 parallel 3;
grunt> store storeme into '/user/hive/countrysplit_parallel';

Then if we have a look at the directory in hdfs:

[root@sandbox arthurs_stuff]# hadoop fs -ls /user/hive/countrysplit_parallel
Found 4 items
-rw-r--r--   3 hive hdfs          0 2016-04-08 10:19 /user/hive/countrysplit_parallel/_SUCCESS
-rw-r--r--   3 hive hdfs       3984 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00000
-rw-r--r--   3 hive hdfs       4614 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00001
-rw-r--r--   3 hive hdfs       4768 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00002

Hope that helps.

Upvotes: 4

user3072054
user3072054

Reputation: 349

My requirement changed a bit, I have to store only the first 25% of the data into one file and the rest to another file. Here is the pig script that worked for me.

ip_file = LOAD 'input file' using PigStorage('|');
rank_file = RANK ip_file by $2;
rank_group = GROUP rank_file ALL;
with_max = FOREACH rank_group GENERATE COUNT(rank_file),FLATTEN(rank_file);
top_file = filter with_max by $1  <=  $0/4;
rest_file = filter with_max by $1  >  $0/4;
sort_top_file = order top_file by $1 parallel 1;
store sort_top_file into 'output file 1' using PigStorage('|');
store rest_file into 'output file 2 using PigStorage('|');

Upvotes: 0

Pradeep Bhadani
Pradeep Bhadani

Reputation: 4721

You can use some of the below PIG feature to achieve your desired result.

  1. SPLIT function http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#SPLIT
  2. MultiStorage class : https://pig.apache.org/docs/r0.10.0/api/org/apache/pig/piggybank/storage/MultiStorage.html
  3. Write custom PIG storage : https://pig.apache.org/docs/r0.7.0/udf.html#Store+Functions

You have to provide some condition based on your data.

Upvotes: 1

Vignesh I
Vignesh I

Reputation: 2221

This could do. But may be there could be a better option.

A = LOAD 'file' using PigStorage() as (line:chararray);
B = RANK A;
C = FILTER B BY rank_A > 1 and rank_A <= 25;
D = FILTER B BY rank_A > 25 and rank_A <= 50;
E = FILTER B BY rank_A > 50 and rank_A <= 75;
F = FILTER B BY rank_A > 75 and rank_A <= 100;
store C into 'file1';
store D into 'file2';
store E into 'file3';
store F into 'file4';

Upvotes: 0

Related Questions