Reputation: 315
I have a large file (20 GB) consisting of 30 million data records. The first field of each record is a non-unique key. The file is sorted by this key. I'd like to split this file into chunks, using anything available in a bash shell, such that the chunks are approximately the same size in bytes and all records with the same key go into the same chunk in the same order as in the original file.
I obviously do not want awk -F";" '{print > $1}' theFile
because I'd prefer on the order of 10 large chunks, not one file per key. Also, split
alone won't cut it, because I need a way to keep identical keys together.
Upvotes: 0
Views: 212
Reputation: 27215
You can pre-process the file such that split
knows where it is allowed to split.
Here, we insert the null byte \0
to mark that splitting is allowed. Afterwards, we remove all \0
from the generated files. This assumes that your original data never contains \0
.
awk -F\; '$1!=last {last=$1; if(NR>1) printf "\0"} 1' file > file.tmp
split -t\\0 --filter 'tr -d \\0 > "$FILE"' -n l/10 file.tmp file.
rm file.tmp
You can adapt split
's options to your liking. Here we split into 10 chunks. Due to our pre-processing and the changed delimiter -t\\0
, the chunk option l/…
keeps identical keys together.
To verify that everything worked, you can run
for i in file.*; do
echo "--- $i ---"
head -n1 "$i"
echo "[...]"
tail -n1 "$i"
done
I generated a testfile using
n=1""000""000; paste -d\; <(shuf -i1-500 -rn$n) <(shuf -rn$n /usr/share/dict/words) | sort -t\; -k1,1n > file
and got
--- file.aa ---
1;abbesses
[...]
55;Zoroaster
--- file.ab ---
56;abase
[...]
107;zoologists
--- file.ac ---
108;abattoir
[...]
and so on.
Upvotes: 2
Reputation: 17493
You mention split
as a tag, but do you know that the UNIX/Linux command split
exists, especially for this purpose?
The man-page mentions (amongst others):
-b, --bytes=SIZE
put SIZE bytes per output file
There are plenty of examples all over the internet.
Edit: apparently split
is not a good option
I've created a file, try.txt
, with following content:
x1 a b
x1 b c
x1 c d
x2 a b
x2 a d
x3 a b
x3 a c
x3 b c
x3 c c
First, I need to know how many lines there are per key:
Linux prompt>cat try.txt | awk '{print $1}' | sort | uniq -c
3 x1
2 x2
4 x3
(Remark: uniq -c
shows the count of unique entries)
So, there is 3 times "x1", 2 times "x2", and 4 times "x3". Now let's take those parts:
Linux prompt>head -n 3 try.txt
x1 a b
x1 b c
x1 c d
Linux prompt>head -n $((3+2)) try.txt | tail -n 2
x2 a b
x2 a d
Linux prompt>head -n $((3+2+4)) try.txt | tail -n 4
x3 a b
x3 a c
x3 b c
x3 c c
It's not an entirely scripted solution, but I guess it might be helpful for you.
Upvotes: 1