balu
balu

Reputation: 19

Merge huge number of files into one file by reading the files in ascending order

I want to merge a large number of files into a single file and this merge file should happen based on ascending order of the file name. I have tried the below command and it works as intended but the only problem is that after the merge the output.txt file contains whole data in a single line because all the input files have only one line of data without any newline.

Is there any way to merge each file data into output.txt as separate line rather than merging every file data into a single line?

My list of files has the naming format of 9999_xyz_1.json, 9999_xyz_2.json, 9999_xyz_3.json, ....., 9999_xyz_12000.json.

Example:

$ cat 9999_xyz_1.json
abcdef
$ cat 9999_xyz_2.json
12345
$ cat 9999_xyz_3.json
Hello

Expected output.txt:

abcdef
12345
Hello

Actual output:

$ ls -d -1 -v  "$PWD/"9999_xyz_*.json | xargs cat
abcdef12345

EDIT:

Since my input files won't contain any spaces or special characters like backslash or quotes, I decided to use the below command which is working for me as expected.

find . -name '9999_xyz_*.json' -type f | sort -V | xargs awk 1 > output.txt

Tried with file name containing a space and below are the results with 2 different commands.

Example:

$ cat 9999_xyz_1.json
abcdef
$ cat 9999_ xyz_2.json      -- This File name contains a space
12345
$ cat 9999_xyz_3.json
Hello

Expected output.txt:

abcdef
12345
Hello

Command:

find . -name '9999_xyz_*.json' -print0 -type f | sort -V | xargs -0 awk 1 > output.txt

Output:

Successfuly completed the merge as expected but with an error at the end.

abcdef
12345
hello

awk: cmd. line:1: fatal: cannot open file `
' for reading (No such file or directory)

Command:

Here I have used the sort with -zV options to avoid the error occured in the above command.

find . -name '9999_xyz_*.json' -print0 -type f | sort -zV | xargs -0 awk 1 > output.txt

Output:

Command completed successfully but results are not as expected. Here the file name having space is treated as last file after the sort. The expectation is that the file name with space should be at second position after the sort.

abcdef
hello
12345

Upvotes: 1

Views: 324

Answers (2)

oguz ismail
oguz ismail

Reputation: 50760

Don't parse the output of ls, use an array instead.

for fname in 9999_xyz_*.json; do
  index="${fname##*_}"
  index="${index%.json}"
  files[index]="$fname"
done && awk 1 "${files[@]}" > output.txt

Another approach that relies on GNU extensions:

printf '%s\0' 9999_xyz_*.json | sort -zV | xargs -0 awk 1 > output.txt

Upvotes: 1

joanis
joanis

Reputation: 12229

I would approach this with a for loop, and use echo to add the newline between each file:

for x in `ls -v -1 -d "$PWD/"9999_xyz_*.json`; do
   cat $x
   echo
done > output.txt

Now, someone will invariably comment that you should never parse the output of ls, but I'm not sure how else to sort the files in the right order, so I kept your original ls command to enumerate the files, which worked according to your question.

EDIT

You can optimize this a lot by using awk 1 as @oguzismail did in his answer:

ls -d -1 -v  "$PWD/"9999_xyz_*.json | xargs awk 1 > output.txt

This solution finishes in 4 seconds on my machine, with 12000 files as in your question, while the for loop takes 13 minutes to run. The difference is that the for loop launches 12000 cat processes, while the xargs needs only a handful to awk processes, which is a lot more efficient.

Note: if want to you upvote this, make sure to upvote @oguzismail's answer too, since using awk 1 is his idea. But his answer with printf and sort -V is safer, so you probably want to use that solution anyway.

Upvotes: 1

Related Questions