Sebastian
Sebastian

Reputation: 315

Split sorted file without cutting blocks

I have a large file (20 GB) consisting of 30 million data records. The first field of each record is a non-unique key. The file is sorted by this key. I'd like to split this file into chunks, using anything available in a bash shell, such that the chunks are approximately the same size in bytes and all records with the same key go into the same chunk in the same order as in the original file.

I obviously do not want awk -F";" '{print > $1}' theFile because I'd prefer on the order of 10 large chunks, not one file per key. Also, split alone won't cut it, because I need a way to keep identical keys together.

Upvotes: 0

Views: 212

Answers (2)

Socowi
Socowi

Reputation: 27215

You can pre-process the file such that split knows where it is allowed to split.

Here, we insert the null byte \0 to mark that splitting is allowed. Afterwards, we remove all \0 from the generated files. This assumes that your original data never contains \0.

awk -F\; '$1!=last {last=$1; if(NR>1) printf "\0"} 1' file > file.tmp
split -t\\0 --filter 'tr -d \\0 > "$FILE"' -n l/10 file.tmp file.
rm file.tmp

You can adapt split's options to your liking. Here we split into 10 chunks. Due to our pre-processing and the changed delimiter -t\\0, the chunk option l/… keeps identical keys together.

To verify that everything worked, you can run

for i in file.*; do
  echo "--- $i ---"
  head -n1 "$i"
  echo "[...]"
  tail -n1 "$i"
done

I generated a testfile using

n=1""000""000; paste -d\; <(shuf -i1-500 -rn$n) <(shuf -rn$n /usr/share/dict/words) | sort -t\; -k1,1n > file

and got

--- file.aa ---
1;abbesses
[...]
55;Zoroaster
--- file.ab ---
56;abase
[...]
107;zoologists
--- file.ac ---
108;abattoir
[...]

and so on.

Upvotes: 2

Dominique
Dominique

Reputation: 17493

You mention split as a tag, but do you know that the UNIX/Linux command split exists, especially for this purpose?

The man-page mentions (amongst others):

-b, --bytes=SIZE
    put SIZE bytes per output file

There are plenty of examples all over the internet.

Edit: apparently split is not a good option

I've created a file, try.txt, with following content:

x1 a b
x1 b c
x1 c d
x2 a b
x2 a d
x3 a b
x3 a c
x3 b c
x3 c c

First, I need to know how many lines there are per key:

Linux prompt>cat try.txt | awk '{print $1}' | sort | uniq -c
  3 x1
  2 x2
  4 x3

(Remark: uniq -c shows the count of unique entries)

So, there is 3 times "x1", 2 times "x2", and 4 times "x3". Now let's take those parts:

Linux prompt>head -n 3 try.txt
  x1 a b
  x1 b c
  x1 c d

Linux prompt>head -n $((3+2)) try.txt | tail -n 2
  x2 a b
  x2 a d

Linux prompt>head -n $((3+2+4)) try.txt | tail -n 4
  x3 a b
  x3 a c
  x3 b c
  x3 c c

It's not an entirely scripted solution, but I guess it might be helpful for you.

Upvotes: 1

Related Questions