Don P
Don P

Reputation: 63567

Combine several files, with a separator, into one file

I have several (~300,000) files of individual JSON objects that I want to combine into a single file that is a JSON array. How can I do this on linux assuming they are all in the location "~/data_files"?

FileA

{
  name: "Test",
  age: 23
}

FileB

{
  name: "Foo",
  age: 5
}

FileC

{
  name: "Bar",
  age: 5
}

Example Output: (begins and ends with brackets, and added commas between objects)

[
    {
      name: "Test",
      age: 23
    },
    {
      name: "Foo",
      age: 5
    },
    {
      name: "Bar",
      age: 5
    }
]

What I've tried:

I know I can use cat to combine a bunch of files, not sure how to do it for all files in a directory yet, but trying to figure that out. Also trying to figure out how to have the , between files I'm concatenating, haven't seen a command for it yet.

Upvotes: 7

Views: 4039

Answers (5)

christian elsee
christian elsee

Reputation: 121

jc.. use jq, it is or should be best practice at the point

$ cat <<eof | jq -s
> { "key": 1 }
> { "key2": 2 }
> { "key3": 3 }
> eof
[
  {
    "key": 1
  },
  {
    "key2": 2
  },
  {
    "key3": 3
  }
]

If your reqs are to JUST push json objects into queue, any other suggestion is naive at best, which is not a statement based on opinion.

Upvotes: 0

Niall Cosgrove
Niall Cosgrove

Reputation: 1303

Since you seem a little new to unix I'll try to give you a solution that is simple and doesn't introduce too many new concepts. I'll leave clever and novel to the other posters. This solution will be very efficient since all I'm doing is streaming files into files.

To start with we will create a new file in our home directory with a square bracket in it.
echo "[" > ~/tmp.json

Now we loop through all the files in your data_files directory and append them to our new file. The >> will add them to whats already there. If you used a > then the file would get overwritten each time. The echo will add a comma when the cat has finished outputting the file.
for i in ~/data_files/*; do cat $i;echo ","; done >> ~/tmp.json

So now we have your 300k files in one file called tmp.json, with each entry seperated by a comma, but the last line of the file is also a comma and that is not what we want.
The sed command below behaves like cat except that '$d' tells it to omit the last line of the file.
So we create a new file with all but the last line of our temporary file.
sed '$d' ~/tmp.json > ~/finished.json

We need to close our square bracket
echo "]" >> ~/finished.json

And finally we delete our temporary file rm ~/tmp.json

And we are done.

[
{
    name: "Test",
    age: 23
}
,
{
    name: "Foo",
    age: 5
}
,
{
    name: "Bar",
    age: 5
}
]

A quick glance at this post about pretty printing json will point you at a command line tool that will take your finished.json file and turn it into exactly the output you asked for.

Upvotes: 11

Vadim Key
Vadim Key

Reputation: 1234

And python version for completeness:

import os, sys

dir = sys.argv[1]

print "["
for fn in os.listdir(dir):
    with open(dir + '/'  + fn, 'r') as f:
        read_data = f.read()
        print read_data,
    print ","
print "]"

Upvotes: 1

karakfa
karakfa

Reputation: 67467

a simple for loop and couple of sed will do

$ echo "[" > all; 
  for f in file{A,B,C}; 
  do 
     sed 's/^/\t/;$s/$/,/' "$f" >> all; 
  done; 
  sed -i '$s/,/\n]/' all

$ cat all
[
 {
   name: "Test",
   age: 23
 },
 {
   name: "Foo",
   age: 5
 },
 {
   name: "Bar",
   age: 5
 }
]

or the same to stdout

$ echo "["; for f in file{A,B,C}; do sed 's/^/\t/;$s/$/,/' "$f"; done |
sed `'$s/,/\n]/'`

to run for all files in the directory change file{A,B,C} to *

Upvotes: 2

Andrey
Andrey

Reputation: 2583

This script should work even if the number of files is 300K+. Also this script is faster than sed solution since input files are not modified.

#!/bin/sh
tmp="/dev/shm/${USER}.find.tmp"
out='all.json'
find . -maxdepth 1 -name file\* > ${tmp}
echo '[' > ${out}
for f in $(head -n -1 ${tmp})
do
  cat ${f} >> ${out}
  echo ',' >> ${out}
done
f=$(tail -n 1 ${tmp})
cat ${f} >> ${out}
echo ']' >> ${out}
rm -f -- ${tmp}

Upvotes: 1

Related Questions