Efficient way of splitting large file into smaller ones based on column value

Question

I have a very large csv file which I want to split into smaller files so that all the entries in the large file which have same value for the ID column (second column in the csv) end up in the same file. However, I also need to have 50 different IDs in each of the smaller file.

I have the code to do this, but for a 1 gig file, it takes around 15 - 20 mins. Is there an efficient way of doing this?

This is what I have right now:

awk -F, '{if(NR > 1) {print >> $2"_backfill_tmp.csv"; close($2"_backfill_tmp.csv")}}' $input_file
counter=0
for file in *"_backfill_tmp.csv"
do
  file_name=${input_file%.*}"_backfill2_part_"$PART_NUMBER".csv"
  cat "$file" >> "$file_name"
  rm "$file"
  (( counter++ ))
  if (( $counter % 50 == 0 )) ; then
    (( PART_NUMBER++ ))
  fi
done

The awk command writes each row into separate file based on value of column 2 (ignoring the first row which is a header), so that each of the rows with same ID value end up in the same file. I close the files every time because I run into a Too many files open error and I can't set the ulimit on the machine. However, this process only takes around 15 seconds, so it is not a concern.

I then loop through each of the temporarily created files, and write them into separate file till the $counter reaches 50 (i.e combine 50 file together). Now this is where it is taking a lot of time. I'm guessing since there are a lot of individual ID files, opening them one by one and merging them takes a long time.

I'm an awk beginner, so I'm pretty sure my code isn't efficient. Is there anyway I can do the whole process faster?

Socowi · Accepted Answer

Script For Unsorted Input

You can use the following script. I did not use close since now the number of open files is only #uniqueIDs / 50 instead of #uniqueIDs.

awk -F, 'NR > 1 {
  if (!($2 in mapIdToPart)) {
    if (uniqueIds % 50 == 0) {
      maxPart++;
    }
    mapIdToPart[$2] = maxPart;
    uniqueIds++;
  }
  print >> "part"mapIdToPart[$2]".csv";
}' input.csv

This creates files part#.csv where # is the number of the current part. The input file doesn't have to be sorted. Rows with same IDs will go to the same part. The order of rows in each part corresponds to the order of rows in the input file. Each part has 50 (or less, for the last part) unique IDs.

Script For Sorted Input

You can speed up the script when your input file is sorted by IDs, because then you don't need the mapping mapIdToPart and each generated part is written in one go.

The order can be alphabetical, numeric, ..., it doesn't matter. Here I assumed that the sorted file doesn't have a header anymore. If there is still a header, add NR > 1 at the beginning of the awk script.

awk -F, '{
  if ($2 != lastId) {
    lastId = $2;
    if (uniqueIds % 50 == 0) {
      close("part"maxPart".csv");
      maxPart++;
    }
    uniqueIds++;
  }
  print >> "part"maxPart".csv";
}' sorted.csv

Benchmark

To test the script, I generated sample data using

n=98""000""000; paste -d,
    <(shuf -i 10""000-99""000 -r -n "$n") \
    <(shuf -i 0-9""999 -r -n "$n") \
| cat <(echo data,id) - > input.csv

The sample data had two columns and 98 million rows with numbers in them. There where 10 thousand unique IDs. Meassured times where

3m 54s to run the first script on the unsorted input.
1m 19s to sort the input using tail -n +2 input.csv | LC_ALL=C sort -t, -k2 > sorted.csv. The tail part removes the header.
1m 48s to run the second script on the sorted input.
3m 07s For sorting and running the second script.

Conclusion: Even when your input is not sorted, it could be faster to sort and then run the second script.

Efficient way of splitting large file into smaller ones based on column value

Answers (2)

Script For Unsorted Input

Script For Sorted Input

Benchmark

Related Questions