awk: Split file using filenames different to the field

Question

I have a very large CSV file, input.csv, that looks like this:

https://www.youtube.com/watch?v=9t5V_sMVN5I, 0.66, 0.7, 89
https://www.youtube.com/watch?v=9t5V_sMVN5I, 0.56, 0.98, 87
https://www.youtube.com/watch?v=9t5V_sMVN5I, 0.66, 0.7, 89
https://www.youtube.com/watch?v=b7kKTSVbfdA, 0.56, 0.98, 87
https://www.youtube.com/watch?v=b7kKTSVbfdA, 0.66, 0.7, 89
https://www.youtube.com/watch?v=b7kKTSVbfdA, 0.56, 0.98, 87
https://www.youtube.com/watch?v=b7kKTSVbfdA, 0.66, 0.7, 89

I am trying to save the contents (all the columns) of this file based on the URL in the first column into separate files.

So the output for the above snippet should be two files:

https://www.youtube.com/watch?v=9t5V_sMVN5I, 0.66, 0.7, 89
https://www.youtube.com/watch?v=9t5V_sMVN5I, 0.56, 0.98, 87
https://www.youtube.com/watch?v=9t5V_sMVN5I, 0.66, 0.7, 89

and

https://www.youtube.com/watch?v=b7kKTSVbfdA, 0.56, 0.98, 87
https://www.youtube.com/watch?v=b7kKTSVbfdA, 0.66, 0.7, 89
https://www.youtube.com/watch?v=b7kKTSVbfdA, 0.56, 0.98, 87
https://www.youtube.com/watch?v=b7kKTSVbfdA, 0.66, 0.7, 89

To split this file based on the first column, I am using awk thus:

awk -F, '{print >> ($1".csv")}' input.csv

However, I am unable to save to any file based on the URL field because of this error:

awk: cmd. line:1: (FILENAME=input.csv FNR=1) fatal: can't redirect to `    https://www.youtube.com/watch?v=9t5V_sMVN5I.csv' (No such file or directory)

Saving a file using the URL-style string as filename is apparently causing some error. The many '/' must be causing the problem in the file path.

Is there any way to save the contents based on column 1 ($1) using awk, but such the output files are named differently, perhaps following a sequence like numbering 1..N? The other option is to replace every URL with some unique identifier and then split on that -- however I have not yet been able to script this up.

Any help would be appreciated!

Sundeep · Accepted Answer

Since the first column has regular format with string after = serving as unique identifier, we can use that

awk -F, '{split($1,a,"="); print > (a[2]".csv")}' input.csv

$ cat b7kKTSVbfdA.csv
https://www.youtube.com/watch?v=b7kKTSVbfdA, 0.56, 0.98, 87
https://www.youtube.com/watch?v=b7kKTSVbfdA, 0.66, 0.7, 89
https://www.youtube.com/watch?v=b7kKTSVbfdA, 0.56, 0.98, 87
https://www.youtube.com/watch?v=b7kKTSVbfdA, 0.66, 0.7, 89

$ cat 9t5V_sMVN5I.csv
https://www.youtube.com/watch?v=9t5V_sMVN5I, 0.66, 0.7, 89
https://www.youtube.com/watch?v=9t5V_sMVN5I, 0.56, 0.98, 87
https://www.youtube.com/watch?v=9t5V_sMVN5I, 0.66, 0.7, 89

Reference:

awk: Split file using filenames different to the field

Answers (2)

Related Questions