dreddy
dreddy

Reputation: 463

Hadoop concatenate files recursively maintaining directory structure

I would like to add files of 2 directories under a single directory, also maintaining the directory structure.

i have directory 1 and directory 2, each with approx 80 subdirectories structured in the way shown below.

Directory 1 on HDFS:

Directory 2 on HDFS:

i want to combine file11 of dir 1 and file 26 of dir 2 to be present under a single directory, file 13 of dir 1 and dir 27 and so on. Destination directory is directory 1.

The files that are to the added from dir 2 to dir 1 should match the path of subdirectory.

Desired Output:

Any help is appreciated.

Upvotes: 0

Views: 903

Answers (2)

dreddy
dreddy

Reputation: 463

I'm getting a unique file name for each file under directory 2 and adding to the right subdirectory under dir 1. Following is the script:

for file in $(hadoop fs -ls /user/hadoop/2/* | grep -o -e "/user/hadoop/2/.*") ; do

subDir=$(echo $file | cut -d '/' -f 5)
fileName=$(echo $file | cut -d '/' -f 6)
uuid=$(uuidgen)
newFileName=$fileName"_"$uuid

    hadoop fs -cp $file /user/hadoop/1/$subDir/$newFileName
done

Upvotes: 1

Ram Ghadiyaram
Ram Ghadiyaram

Reputation: 29155

Use org.apache.hadoop.fs.FileUtil API

You get FileSystem with below API

 final FileSystem fs = FileSystem.get(conf);
  • copy

    public static boolean copy(FileSystem srcFS, Path[] srcs, FileSystem dstFS, Path dst, boolean deleteSource, boolean overwrite, Configuration conf) throws IOException Throws: IOException

This method Copy files between FileSystems.

  • Replace file FileUtil.replaceFile(File src, File target) should also work

see documentation for this method "Move the src file to the name specified by target."

In either case, you need to list your common folder /user/hadoop/2/abc/ /user/hadoop/1/abc/ by comparing after third slash character and if they are matching copy source to destination or develop logic according to your requirement(this I will leave it to you :-))

after copying to desired target : you can see them on the flow with below example method

/**
         * Method listFileStats.
         * 
         * @param destination
         * @param fs
         * @throws FileNotFoundException
         * @throws IOException
         */
        public static void listFileStats(final String destination, final FileSystem fs) throws FileNotFoundException, IOException {
            final FileStatus[] statuss = fs.listStatus(new Path(destination));
            for (final FileStatus status : statuss) {
///below log are sl4j you can use other loggers..
                LOG.info("--  status {}    ", status.toString());
            }
        }

Upvotes: 0

Related Questions