Split sorted file without cutting blocks

Question

I have a large file (20 GB) consisting of 30 million data records. The first field of each record is a non-unique key. The file is sorted by this key. I'd like to split this file into chunks, using anything available in a bash shell, such that the chunks are approximately the same size in bytes and all records with the same key go into the same chunk in the same order as in the original file.

I obviously do not want awk -F";" '{print > $1}' theFile because I'd prefer on the order of 10 large chunks, not one file per key. Also, split alone won't cut it, because I need a way to keep identical keys together.

Socowi · Accepted Answer

You can pre-process the file such that split knows where it is allowed to split.

Here, we insert the null byte \0 to mark that splitting is allowed. Afterwards, we remove all \0 from the generated files. This assumes that your original data never contains \0.

awk -F\; '$1!=last {last=$1; if(NR>1) printf "\0"} 1' file > file.tmp
split -t\0 --filter 'tr -d \0 > "$FILE"' -n l/10 file.tmp file.
rm file.tmp

You can adapt split's options to your liking. Here we split into 10 chunks. Due to our pre-processing and the changed delimiter -t\0, the chunk option l/… keeps identical keys together.

To verify that everything worked, you can run

for i in file.*; do
  echo "--- $i ---"
  head -n1 "$i"
  echo "[...]"
  tail -n1 "$i"
done

I generated a testfile using

n=1""000""000; paste -d\; <(shuf -i1-500 -rn$n) <(shuf -rn$n /usr/share/dict/words) | sort -t\; -k1,1n > file

and got

--- file.aa ---
1;abbesses
[...]
55;Zoroaster
--- file.ab ---
56;abase
[...]
107;zoologists
--- file.ac ---
108;abattoir
[...]

and so on.

Split sorted file without cutting blocks

Answers (2)

Related Questions