Reputation: 349
I want to split a file into 4 equal parts using Apache pig. Example, if a file has 100 lines the first 25 should go to the 1st output file and so on.. the last 25 lines should go to the 4th output file. Can someone help me to achieve this. I am using Apache pig because the number of records in the file will be in Millions and there are previous steps that generate the file that needs to be split uses Pig.
Upvotes: 2
Views: 1903
Reputation: 458
I did a bit of digging on this, because it comes up the the Hortonworks sample exam for hadoop. It doesn't seem to be well documented - but its quite simple really. In this example I was using the Country sample database offered for download on dev.mysql.com:
grunt> storeme = order data by $0 parallel 3;
grunt> store storeme into '/user/hive/countrysplit_parallel';
Then if we have a look at the directory in hdfs:
[root@sandbox arthurs_stuff]# hadoop fs -ls /user/hive/countrysplit_parallel
Found 4 items
-rw-r--r-- 3 hive hdfs 0 2016-04-08 10:19 /user/hive/countrysplit_parallel/_SUCCESS
-rw-r--r-- 3 hive hdfs 3984 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00000
-rw-r--r-- 3 hive hdfs 4614 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00001
-rw-r--r-- 3 hive hdfs 4768 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00002
Hope that helps.
Upvotes: 4
Reputation: 349
My requirement changed a bit, I have to store only the first 25% of the data into one file and the rest to another file. Here is the pig script that worked for me.
ip_file = LOAD 'input file' using PigStorage('|');
rank_file = RANK ip_file by $2;
rank_group = GROUP rank_file ALL;
with_max = FOREACH rank_group GENERATE COUNT(rank_file),FLATTEN(rank_file);
top_file = filter with_max by $1 <= $0/4;
rest_file = filter with_max by $1 > $0/4;
sort_top_file = order top_file by $1 parallel 1;
store sort_top_file into 'output file 1' using PigStorage('|');
store rest_file into 'output file 2 using PigStorage('|');
Upvotes: 0
Reputation: 4721
You can use some of the below PIG feature to achieve your desired result.
You have to provide some condition based on your data.
Upvotes: 1
Reputation: 2221
This could do. But may be there could be a better option.
A = LOAD 'file' using PigStorage() as (line:chararray);
B = RANK A;
C = FILTER B BY rank_A > 1 and rank_A <= 25;
D = FILTER B BY rank_A > 25 and rank_A <= 50;
E = FILTER B BY rank_A > 50 and rank_A <= 75;
F = FILTER B BY rank_A > 75 and rank_A <= 100;
store C into 'file1';
store D into 'file2';
store E into 'file3';
store F into 'file4';
Upvotes: 0