Jan
Jan

Reputation: 248

AWK Split large CSV file with headers and print output files based on column value

I have a CSV file of around 800 mb which I need to split up using AWK. The file has a column with ID's in them which I want to use to split the file on. I'm familiar/know how to accomplish this with Perl but not with AWK since I've only used it a few times. (In perl I would use the Text::CSV module but I don't have the option in this case)

I found this answer: https://stackoverflow.com/a/16795137 which is basically what I want but with a small alteration. It has to contain an if statement so it will only print if the column I want to split it on is a digit. This is necessary because the file column can shift sometimes and I want to send the non-digit lines to a seperate file (junk.csv).

I'm using the windows cmd version for testing right now but I'll eventually run it on linux. (Below original code)

awk -F, "NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr\"\n\">$3\".csv\"}{print>$3\".csv\"}" test.csv

And my intention is this:

awk -F";" "{if ($3 ~ /^[0-9]+$/){"NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr>$3\".csv\"}{print>$3\".csv\"}"" test.csv

I can't figure out how to do this in AWK (just yet). The double quotes are also throwing me off (because of the windows version). Where am I going wrong?

This is my error output:

awk: {if($3 ~ /^[0-9]+$/) NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr>$3".csv"}{print>$3.csv};else print>junk.csv}
awk:                           ^ syntax error
awk: {if($3 ~ /^[0-9]+$/) NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr>$3".csv"}{print>$3.csv};else print>junk.csv}
awk:                                                      ^ syntax error
awk: {if($3 ~ /^[0-9]+$/) NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr>$3".csv"}{print>$3.csv};else print>junk.csv}
awk:                                                                                                     ^ syntax error
awk: {if($3 ~ /^[0-9]+$/) NR==1{hdr=$0;next}!($3 in files){files[$3]=1;print hdr>$3".csv"}{print>$3.csv};else print>junk.csv}
awk:                                                                                                                    ^ syntax error
errcount: 4

This is my (sample) data:

10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059
10004766;12.99;48;http://testdata.com/bla/29007085.jpg;5.95;95074666117
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;10201848233
10009363;119.0;53;http://testdata.com/bla/29004907.jpg;5.95;9823036360
10009631;19.95;48;http://testdata.com/bla/29013097.jpg;5.95;20689058198
10010119;9.99;48;http://testdata.com/bla/29016592.jpg;5.95;80076014280
10012615;20.99;53;http://testdata.com/bla/28772382.jpg;5.95;3948187983
10015250;14.99;48;http://testdata.com/bla/29015812.jpg;5.95;93962045440
10019190;69.99;53;http://testdata.com/bla/29010968.jpg;5.95;948187983
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10034957;34.99;53;http://testdata.com/bla/29000529.jpg;5.95;42872898825
10041967;24.99;65;http://testdata.com/bla/28781700.jpg;5.95;91229911080
10045277;59.99;65;http://testdata.com/bla/29010583.jpg;5.95;67365082290
10045795;10.99;48;http://testdata.com/bla/29002819.jpg;5.95;19422308188
10048375;26.99;26;http://testdata.com/bla/29002270.jpg;5.95;95082912275
10052550;19.99;48;http://testdata.com/bla/29016347.jpg;5.95;7368425436

And I want to accomplish this:

File --> 26.csv
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10048375;26.99;26;http://testdata.com/bla/29002270.jpg;5.95;95082912275

File --> 48.csv
10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10004766;12.99;48;http://testdata.com/bla/29007085.jpg;5.95;95074666117
10009631;19.95;48;http://testdata.com/bla/29013097.jpg;5.95;20689058198
10010119;9.99;48;http://testdata.com/bla/29016592.jpg;5.95;80076014280
10015250;14.99;48;http://testdata.com/bla/29015812.jpg;5.95;93962045440
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10045795;10.99;48;http://testdata.com/bla/29002819.jpg;5.95;19422308188
10052550;19.99;48;http://testdata.com/bla/29016347.jpg;5.95;7368425436

File --> 53.csv
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059
10009363;119.0;53;http://testdata.com/bla/29004907.jpg;5.95;9823036360
10012615;20.99;53;http://testdata.com/bla/28772382.jpg;5.95;3948187983
10019190;69.99;53;http://testdata.com/bla/29010968.jpg;5.95;948187983
10034957;34.99;53;http://testdata.com/bla/29000529.jpg;5.95;42872898825

File --> 65.csv
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;10201848233
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367
10041967;24.99;65;http://testdata.com/bla/28781700.jpg;5.95;91229911080
10045277;59.99;65;http://testdata.com/bla/29010583.jpg;5.95;67365082290

Upvotes: 3

Views: 2594

Answers (1)

nu11p01n73R
nu11p01n73R

Reputation: 26667

You can simplify the awk as

awk -F\; '{print > $3".csv"}' input

Will produce the following csv files with content

26.csv
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10048375;26.99;26;http://testdata.com/bla/29002270.jpg;5.95;95082912275
48.csv
10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10004766;12.99;48;http://testdata.com/bla/29007085.jpg;5.95;95074666117
10009631;19.95;48;http://testdata.com/bla/29013097.jpg;5.95;20689058198
10010119;9.99;48;http://testdata.com/bla/29016592.jpg;5.95;80076014280
10015250;14.99;48;http://testdata.com/bla/29015812.jpg;5.95;93962045440
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10045795;10.99;48;http://testdata.com/bla/29002819.jpg;5.95;19422308188
10052550;19.99;48;http://testdata.com/bla/29016347.jpg;5.95;7368425436
53.csv
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059
10009363;119.0;53;http://testdata.com/bla/29004907.jpg;5.95;9823036360
10012615;20.99;53;http://testdata.com/bla/28772382.jpg;5.95;3948187983
10019190;69.99;53;http://testdata.com/bla/29010968.jpg;5.95;948187983
10034957;34.99;53;http://testdata.com/bla/29000529.jpg;5.95;42872898825
65.csv
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;10201848233
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367
10041967;24.99;65;http://testdata.com/bla/28781700.jpg;5.95;91229911080
10045277;59.99;65;http://testdata.com/bla/29010583.jpg;5.95;67365082290

NOTE

If you want to send the lines which have non digits in column 3 to junk.csv a small change in the above awk can be helpfull

awk -F\; '$3 ~ /^[0-9]+$/{print > $3".csv"; next} {print > "junk.csv"}' input
  • $3 ~ /^[0-9]+$/ performs a regex match on column 3 and if it matches, sends to corresponding csv file. else the line is written to junk.csv

OR

a much simpler version like

awk -F\; '{file=$3~/^[0-9]+$/?$3:"junk";print >file".csv"}'

Thanks to Jidder for the suggestion.

Upvotes: 3

Related Questions