Reputation: 835
I have a gzip file of size 81G which I unzip and size of uncompressed file is 254G. I want to implement a bash script which takes the gzip file and splits it on the basis of the first column. The first column has values range between 1-10. I want to split the files into 10 subfiles where by all rows where value in first column is 1 is put into 1 file. All the rows where the value is 2 in the first column is put into a second file and so on. While I do that I don't want to put column 3 and column 5 in the new subfiles. Also the file is tab separated. For example:
col_1 col_2. col_3. col_4. col_5. col_6
1. 7464 sam. NY. 0.738. 28.9
1. 81932. Dave. NW. 0.163. 91.9
2. 162. Peter. SD. 0.7293. 673.1
3. 7193. Ooni GH. 0.746. 6391
3. 6139. Jess. GHD. 0.8364. 81937
3. 7291. Yeldish HD. 0.173. 1973
File above will result in three different gzipped files such that col_3 and col_5 are removed from each of the new subfiles. What I did was
#!/bin/bash
#SBATCH --partition normal
#SBATCH --mem-per-cpu 500G
#SBATCH --time 12:00:00
#SBATCH -c 1
awk -F, '{print > $1".csv.gz"}' file.csv.gz
But this is not producing the desired result. Also I don't know how to remove col_3 and col_5 from the new subfiles. Like I said gzip file is 81G and therefore, I am looking for an efficient solution. Insights will be appreciated.
Upvotes: 1
Views: 1174
Reputation: 203655
Robustly and portably with any awk, if the file is always sorted by the first field as shown in your example:
gunzip -c infile.gz |
awk '
{ $0 = $1 OFS $2 OFS $4 OFS $6 }
NR==1 { hdr = $0; next }
$1 != prev { close(gzip); gzip="gzip > \047"$1".csv.gz\047"; prev=$1 }
!seen[$1]++ { print hdr | gzip }
{ print | gzip }
'
otherwise:
gunzip -c infile.gz |
awk 'BEGIN{FS=OFS="\t"} {print (NR>1), NR, $0}' |
sort -k1,1n -k3,3 -k2,2n |
cut -f3- |
awk '
{ $0 = $1 OFS $2 OFS $4 OFS $6 }
NR==1 { hdr = $0; next }
$1 != prev { close(gzip); gzip="gzip > \047"$1".csv.gz\047"; prev=$1 }
!seen[$1]++ { print hdr | gzip }
{ print | gzip }
'
The first awk adds a number at the front to ensure the header line sorts before the rest during the sort
phase, and adds the line number so that lines with the same original first field value retain their original input order. Then we sort by the first field, and then cut away the 2 fields added in the first step, then use awk to robustly and portably create the separate output files, ensuring that each output file starts with a copy of the header. We close each output file as we go so that the script will work for any number of output files using any awk and will work efficiently even for a large number of output files with GNU awk. It also ensures that each output file name is properly quoted to avoid globbing, word splitting, and filename expansion.
Upvotes: 3
Reputation: 52419
Something like
zcat input.csv.gz | cut -f1,2,4,6- | awk '{ print | ("gzip -c > " $1 "csv.gz") }'
Uncompress the file, remove fields 3 and 5, save to the appropriate compressed file based on the first column.
Upvotes: 3
Reputation: 52152
You have to decompress and recompress; to get rid of columns 3 and 5, you could use GNU cut
like this:
gunzip -c infile.gz \
| cut --complement -f3,5 \
| awk '{ print | "gzip > " $1 "csv.gz" }'
Or you could get rid of the columns in awk:
gunzip -c infile.gz \
| awk -v OFS='\t' '{ print $1, $2, $4, $6 | "gzip > " $1 "csv.gz" }'
Upvotes: 3