Anil Arora
Anil Arora

Reputation: 59

PIG : load specific files from a folder

My load function should be sensitive to the age of files , I am interested in only files created in last 1 week and folder has files of 30 days in it.

I am relatively new to Pig and have seen custom loader which but haven't found an option to restrict files to be loaded.

Any help will be appreciated

Thanks

Upvotes: 1

Views: 533

Answers (1)

reo katoa
reo katoa

Reputation: 5801

Don't try to do this within Pig. Use parameter substitution inside a Bash script. If running in Pig in local mode, you can use the find command to grab the files:

#!/bin/bash

DIR=/path/to/directory/of/input/files
pig -p input="{$(find $DIR -maxdepth 1 -type f -mtime -7 | tr '\n' ',')}" myscript.pig
  • find $DIR locates all the files in $DIR.
  • -maxdepth 1 -type f ensures that you will only consider regular files in the directory you specify (no sub-directories).
  • -mtime -7 restricts the listing to files modified in the last 7 days.
  • tr '\n' ',' turns it into a comma-separated list.

Then, in myscript.pig, you would have a statement like data = LOAD '$input' AS (...);

If you are running Pig on a cluster, you will need to use hdfs dfs -ls and do some parsing of the output to get the filenames.

Upvotes: 1

Related Questions