Reputation: 8449
I'm trying to move files (tree structure) from the local file system to hdfs using moveFromLocal hdfs shell command.
If the destination sub directories don't exists, everything works fine. But if they exists (which is the general case - as files are added to existing directories) another level in the hierarchy is created
example:
The original structure on disk
$ find src
src
src/a
src/a/2
src/a/2/file1
src/a/1
src/a/1/file1
src/a/4
src/a/4/file1
src/a/3
src/a/3/file1
src/b
src/b/2
src/b/2/file1
src/b/1
src/b/1/file1
src/b/4
src/b/4/file1
src/b/3
src/b/3/file1
The move command
$hdfs dfs -moveFromLocal src/* /dst
The result (as expected)
$ hdfs dfs -ls -R /dst
drwxr-xr-x - root supergroup 0 2014-02-02 03:39 /dst/a
drwxr-xr-x - root supergroup 0 2014-02-02 03:39 /dst/a/1
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:39 /dst/a/1/file1
drwxr-xr-x - root supergroup 0 2014-02-02 03:39 /dst/a/2
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:39 /dst/a/2/file1
drwxr-xr-x - root supergroup 0 2014-02-02 03:39 /dst/a/3
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:39 /dst/a/3/file1
drwxr-xr-x - root supergroup 0 2014-02-02 03:39 /dst/a/4
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:39 /dst/a/4/file1
drwxr-xr-x - root supergroup 0 2014-02-02 03:39 /dst/b
drwxr-xr-x - root supergroup 0 2014-02-02 03:39 /dst/b/1
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:39 /dst/b/1/file1
drwxr-xr-x - root supergroup 0 2014-02-02 03:39 /dst/b/2
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:39 /dst/b/2/file1
drwxr-xr-x - root supergroup 0 2014-02-02 03:39 /dst/b/3
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:39 /dst/b/3/file1
drwxr-xr-x - root supergroup 0 2014-02-02 03:39 /dst/b/4
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:39 /dst/b/4/file1
The local files in the 2nd batch
$ find src
src
src/a
src/a/2
src/a/2/file2
src/a/1
src/a/1/file2
src/a/4
src/a/4/file2
src/a/3
src/a/3/file2
src/b
src/b/2
src/b/2/file2
src/b/1
src/b/1/file2
src/b/4
src/b/4/file1
src/b/3
src/b/3/file2
Moving the 2nd batch to hdfs
$ hdfs dfs -moveFromLocal src/* /dst
The 2nd batch on hdfs
note that all the "file2" are in a double hierarchy (a/a, instead of just a)
$ hdfs dfs -ls -R /dst
drwxr-xr-x - root supergroup 0 2014-02-02 03:42 /dst/a
drwxr-xr-x - root supergroup 0 2014-02-02 03:39 /dst/a/1
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:39 /dst/a/1/file1
drwxr-xr-x - root supergroup 0 2014-02-02 03:39 /dst/a/2
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:39 /dst/a/2/file1
drwxr-xr-x - root supergroup 0 2014-02-02 03:39 /dst/a/3
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:39 /dst/a/3/file1
drwxr-xr-x - root supergroup 0 2014-02-02 03:39 /dst/a/4
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:39 /dst/a/4/file1
drwxr-xr-x - root supergroup 0 2014-02-02 03:42 /dst/a/a
drwxr-xr-x - root supergroup 0 2014-02-02 03:42 /dst/a/a/1
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:42 /dst/a/a/1/file2
drwxr-xr-x - root supergroup 0 2014-02-02 03:42 /dst/a/a/2
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:42 /dst/a/a/2/file2
drwxr-xr-x - root supergroup 0 2014-02-02 03:42 /dst/a/a/3
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:42 /dst/a/a/3/file2
drwxr-xr-x - root supergroup 0 2014-02-02 03:42 /dst/a/a/4
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:42 /dst/a/a/4/file2
drwxr-xr-x - root supergroup 0 2014-02-02 03:42 /dst/b
drwxr-xr-x - root supergroup 0 2014-02-02 03:39 /dst/b/1
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:39 /dst/b/1/file1
drwxr-xr-x - root supergroup 0 2014-02-02 03:39 /dst/b/2
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:39 /dst/b/2/file1
drwxr-xr-x - root supergroup 0 2014-02-02 03:39 /dst/b/3
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:39 /dst/b/3/file1
drwxr-xr-x - root supergroup 0 2014-02-02 03:39 /dst/b/4
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:39 /dst/b/4/file1
drwxr-xr-x - root supergroup 0 2014-02-02 03:42 /dst/b/b
drwxr-xr-x - root supergroup 0 2014-02-02 03:42 /dst/b/b/1
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:42 /dst/b/b/1/file2
drwxr-xr-x - root supergroup 0 2014-02-02 03:42 /dst/b/b/2
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:42 /dst/b/b/2/file2
drwxr-xr-x - root supergroup 0 2014-02-02 03:42 /dst/b/b/3
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:42 /dst/b/b/3/file2
drwxr-xr-x - root supergroup 0 2014-02-02 03:42 /dst/b/b/4
-rw-r--r-- 3 root supergroup 0 2014-02-02 03:42 /dst/b/b/4/file1
EDIT
I understand that this behavior is by design... I'm open for alternative solutions that perform the same.
The easiest solution is to create a loop that moves each file separately, this is problematic due to performance issues (each hdfs command launches a new jvm)
I also considered using copy instead of move, but I need a efficient and safe way to delete the files that were actually copied.
Upvotes: 1
Views: 736
Reputation: 1399
This behavior is consistent (kind of) with mv on Unix - though its man page doesn't document it, mv will refuse to rename a directory to another directory if the target directory contains files:
[evgeny@dev1]\$ mv src/* dst/
mv: cannot move 'src/subsrc' to 'dst/subsrc': Directory not empty
Unfortunately you have to clean dst dir first: "hadoop fs -rmr dst".
Upvotes: 1
Reputation: 8449
org.apache.hadoop.fs.FileContext (a wrapper \ replacement of org.apache.hadoop.fs.FileSystem) has a cleaner API.
among other things, it's rename will (optionally) fail if the directory exists. this will not make the requested merge, but at least will raise an exception and wont create unwanted sub directories.
Upvotes: 0