Reputation: 56
As a result of an hive query I am getting more than one output files(did a distribute by sort by) and now I want to merge them to produce a single file. So I have tried hdfs dfs -getmerge command. Now I want to understand whether -getmerge sorts the files before concatenating or it just concatenates?
Upvotes: 2
Views: 2202
Reputation: 143
Here is the documentation (for Hadoop 2.7.1): https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#getmerge
Basically: 1 - Concatenate the files to one 2 - May insert a new line (-nl) between concatenated files.
Ex: $ hadoop fs -getmerge [-nl] src1 [ src2 [src3]]
Upvotes: 0
Reputation: 444
public static boolean More ...copyMerge(FileSystem srcFS, Path srcDir,
277 FileSystem dstFS, Path dstFile,
278 boolean deleteSource,
279 Configuration conf, String addString) throws IOException {
280 dstFile = checkDest(srcDir.getName(), dstFS, dstFile, false);
281
282 if (!srcFS.getFileStatus(srcDir).isDirectory())
283 return false;
284
285 OutputStream out = dstFS.create(dstFile);
286
287 try {
288 FileStatus contents[] = srcFS.listStatus(srcDir);
289 Arrays.sort(contents);
290 for (int i = 0; i < contents.length; i++) {
291 if (contents[i].isFile()) {
292 InputStream in = srcFS.open(contents[i].getPath());
293 try {
294 IOUtils.copyBytes(in, out, conf, false);
295 if (addString!=null)
296 out.write(addString.getBytes("UTF-8"));
297
298 } finally {
299 in.close();
300 }
301 }
302 }
303 } finally {
304 out.close();
305 }
306
307
308 if (deleteSource) {
309 return srcFS.delete(srcDir, true);
310 } else {
311 return true;
312 }
313 }
It sorts the file array(default ascending order), source hadoop 0.23
Upvotes: 4