Havnar
Havnar

Reputation: 2628

design pattern for distcp directories with wildcards or variables (glob)

I'm aware that distcp is not able to use wildcards. However, I will need to to a scheduled distcp on changing directories. (i.e. copy only data in the "friday" dir on monday etc) but also from all projects under a specified dir.

Is there some sort of design pattern for scripting this kind of thing?

So in short, I want to be able to do:

hadoop distcp /foo/*/bar/$year/$month/$day hdfs://namespace-foo/replication-dir/

Upvotes: 0

Views: 813

Answers (1)

Havnar
Havnar

Reputation: 2628

I ended up using the following function to get to the directories I need.

    function get_list_of_directories_for_input_dir {

        local fvar_dirlist=`hadoop fs -ls "$1" | awk '{print $8}'`
        local fvar_count=`echo "$fvar_dirlist" | wc -l`
        if [ "$fvar_count" -ge "2" ]; then

                local fvar_len=$(($fvar_count - 1))
                local fvar_dirlist=`echo $fvar_dirlist | tail -n $fvar_len`
                echo "$fvar_dirlist"

        else
                exit 1;
        fi

}

Upvotes: 1

Related Questions