Reputation: 3038
I have a very large csv file which I want to split into smaller files so that all the entries in the large file which have same value for the ID column (second column in the csv) end up in the same file. However, I also need to have 50 different IDs in each of the smaller file.
I have the code to do this, but for a 1 gig file, it takes around 15 - 20 mins. Is there an efficient way of doing this?
This is what I have right now:
awk -F, '{if(NR > 1) {print >> $2"_backfill_tmp.csv"; close($2"_backfill_tmp.csv")}}' $input_file
counter=0
for file in *"_backfill_tmp.csv"
do
file_name=${input_file%.*}"_backfill2_part_"$PART_NUMBER".csv"
cat "$file" >> "$file_name"
rm "$file"
(( counter++ ))
if (( $counter % 50 == 0 )) ; then
(( PART_NUMBER++ ))
fi
done
The awk command writes each row into separate file based on value of column 2 (ignoring the first row which is a header), so that each of the rows with same ID value end up in the same file. I close the files every time because I run into a Too many files open error
and I can't set the ulimit
on the machine. However, this process only takes around 15 seconds, so it is not a concern.
I then loop through each of the temporarily created files, and write them into separate file till the $counter
reaches 50 (i.e combine 50 file together). Now this is where it is taking a lot of time. I'm guessing since there are a lot of individual ID files, opening them one by one and merging them takes a long time.
I'm an awk beginner, so I'm pretty sure my code isn't efficient. Is there anyway I can do the whole process faster?
Upvotes: 2
Views: 1506
Reputation: 27205
You can use the following script. I did not use close
since now the number of open files is only #uniqueIDs / 50 instead of #uniqueIDs.
awk -F, 'NR > 1 {
if (!($2 in mapIdToPart)) {
if (uniqueIds % 50 == 0) {
maxPart++;
}
mapIdToPart[$2] = maxPart;
uniqueIds++;
}
print >> "part"mapIdToPart[$2]".csv";
}' input.csv
This creates files part#.csv
where #
is the number of the current part. The input file doesn't have to be sorted. Rows with same IDs will go to the same part. The order of rows in each part corresponds to the order of rows in the input file. Each part has 50 (or less, for the last part) unique IDs.
You can speed up the script when your input file is sorted by IDs, because then you don't need the mapping mapIdToPart
and each generated part is written in one go.
The order can be alphabetical, numeric, ..., it doesn't matter. Here I assumed that the sorted file doesn't have a header anymore. If there is still a header, add NR > 1
at the beginning of the awk
script.
awk -F, '{
if ($2 != lastId) {
lastId = $2;
if (uniqueIds % 50 == 0) {
close("part"maxPart".csv");
maxPart++;
}
uniqueIds++;
}
print >> "part"maxPart".csv";
}' sorted.csv
To test the script, I generated sample data using
n=98""000""000; paste -d,
<(shuf -i 10""000-99""000 -r -n "$n") \
<(shuf -i 0-9""999 -r -n "$n") \
| cat <(echo data,id) - > input.csv
The sample data had two columns and 98 million rows with numbers in them. There where 10 thousand unique IDs. Meassured times where
tail -n +2 input.csv | LC_ALL=C sort -t, -k2 > sorted.csv
. The tail
part removes the header.Conclusion: Even when your input is not sorted, it could be faster to sort and then run the second script.
Upvotes: 5
Reputation: 67467
you should sort the file for best performance
$ sort -t, -k2,2 file | awk '!($2 in a){c++; a[$1];
if(c==50) {ix++; c=0}}
{print > "file_"(ix+1)}'
counts unique keys and increment the file counter after 50; print each line to the indexed file.
you may still need to close files if still above your systems threshold.
Upvotes: 0