Reputation: 59
My load function should be sensitive to the age of files , I am interested in only files created in last 1 week and folder has files of 30 days in it.
I am relatively new to Pig and have seen custom loader which but haven't found an option to restrict files to be loaded.
Any help will be appreciated
Thanks
Upvotes: 1
Views: 533
Reputation: 5801
Don't try to do this within Pig. Use parameter substitution inside a Bash script. If running in Pig in local mode, you can use the find
command to grab the files:
#!/bin/bash
DIR=/path/to/directory/of/input/files
pig -p input="{$(find $DIR -maxdepth 1 -type f -mtime -7 | tr '\n' ',')}" myscript.pig
find $DIR
locates all the files in $DIR
.-maxdepth 1 -type f
ensures that you will only consider regular
files in the directory you specify (no sub-directories).-mtime -7
restricts the listing to files modified in the last 7 days.tr '\n' ','
turns it into a comma-separated list.Then, in myscript.pig
, you would have a statement like data = LOAD '$input' AS (...);
If you are running Pig on a cluster, you will need to use hdfs dfs -ls
and do some parsing of the output to get the filenames.
Upvotes: 1