Reputation: 835

Split a large gz file into smaller ones filtering and distributing content

I have a gzip file of size 81G which I unzip and size of uncompressed file is 254G. I want to implement a bash script which takes the gzip file and splits it on the basis of the first column. The first column has values range between 1-10. I want to split the files into 10 subfiles where by all rows where value in first column is 1 is put into 1 file. All the rows where the value is 2 in the first column is put into a second file and so on. While I do that I don't want to put column 3 and column 5 in the new subfiles. Also the file is tab separated. For example:

col_1    col_2.   col_3.  col_4.  col_5.  col_6
1.       7464      sam.    NY.     0.738.  28.9
1.       81932.    Dave.   NW.     0.163.  91.9
2.       162.      Peter.  SD.     0.7293. 673.1
3.       7193.     Ooni    GH.     0.746.  6391
3.       6139.     Jess.   GHD.    0.8364. 81937
3.       7291.     Yeldish HD.     0.173.  1973

File above will result in three different gzipped files such that col_3 and col_5 are removed from each of the new subfiles. What I did was

#!/bin/bash
#SBATCH --partition normal
#SBATCH --mem-per-cpu 500G
#SBATCH --time 12:00:00
#SBATCH -c 1



awk -F, '{print > $1".csv.gz"}' file.csv.gz

But this is not producing the desired result. Also I don't know how to remove col_3 and col_5 from the new subfiles. Like I said gzip file is 81G and therefore, I am looking for an efficient solution. Insights will be appreciated.

Upvotes: 1

Answers (3)

Ed Morton

Reputation: 203655

Robustly and portably with any awk, if the file is always sorted by the first field as shown in your example:

gunzip -c infile.gz |
awk '
    { $0 = $1 OFS $2 OFS $4 OFS $6 }
    NR==1 { hdr = $0; next }
    $1 != prev { close(gzip); gzip="gzip > \047"$1".csv.gz\047"; prev=$1 }
    !seen[$1]++ { print hdr | gzip }
    { print | gzip }
'

otherwise:

gunzip -c infile.gz |
awk 'BEGIN{FS=OFS="\t"} {print (NR>1), NR, $0}' |
sort -k1,1n -k3,3 -k2,2n |
cut -f3- |
awk '
    { $0 = $1 OFS $2 OFS $4 OFS $6 }
    NR==1 { hdr = $0; next }
    $1 != prev { close(gzip); gzip="gzip > \047"$1".csv.gz\047"; prev=$1 }
    !seen[$1]++ { print hdr | gzip }
    { print | gzip }
'

The first awk adds a number at the front to ensure the header line sorts before the rest during the sort phase, and adds the line number so that lines with the same original first field value retain their original input order. Then we sort by the first field, and then cut away the 2 fields added in the first step, then use awk to robustly and portably create the separate output files, ensuring that each output file starts with a copy of the header. We close each output file as we go so that the script will work for any number of output files using any awk and will work efficiently even for a large number of output files with GNU awk. It also ensures that each output file name is properly quoted to avoid globbing, word splitting, and filename expansion.

Upvotes: 3

Shawn

Reputation: 52419

Something like

zcat input.csv.gz | cut -f1,2,4,6- | awk '{ print | ("gzip -c > " $1 "csv.gz") }'

Uncompress the file, remove fields 3 and 5, save to the appropriate compressed file based on the first column.

Upvotes: 3

Benjamin W.

Reputation: 52152

You have to decompress and recompress; to get rid of columns 3 and 5, you could use GNU cut like this:

gunzip -c infile.gz \
    | cut --complement -f3,5 \
    | awk '{ print | "gzip > " $1 "csv.gz" }'

Or you could get rid of the columns in awk:

gunzip -c infile.gz \
    | awk -v OFS='\t' '{ print $1, $2, $4, $6 | "gzip > " $1 "csv.gz" }'

Upvotes: 3

Split a large gz file into smaller ones filtering and distributing content

Answers (3)

Related Questions