Arindrew
Arindrew

Reputation: 167

Use awk to output each line from a file to a new filename based on specific separators

I have the following file with tabs as the field separators:

header1 header2 header3 header4 header5
1field1 1field2 1field3 1field4 1field5
2field1 2field2 2field3 2field4 2field5
3field1 3field2 3field3 3field4 3field5
4field1 4field2 4field3 4field4 4field5

and would like to output each line to a new file (skipping the first line). Each new file will be named from the 1st and 5th fields with an underscore separator. The file from line 1 (2 technically) would be named "1field1_1field5.txt" and contain all the fields from that line and so on. I have the following awk command which outputs the correct filenames to standard out

awk -v FS='\t' -v OFS='_' 'NR>1 {print ($1,$5 ".txt") }'

but when I try to output the text into filenames instead

awk -v FS='\t' -v OFS='_' 'NR>1 {print > ($1,$5 ".txt") }'

I get the following error

awk: cmd. line:1: NR>1 { print > ($1,$5 ".txt") }
awk: cmd. line:1:                               ^ syntax error

I have copied/pasted from 10 different other articles to get this far, but I'm stuck on how my formatting is wrong.

Upvotes: 3

Views: 77

Answers (3)

Daweo
Daweo

Reputation: 36700

In this case

awk -v FS='\t' -v OFS='_' 'NR>1 {print ($1,$5 ".txt") }'

you are calling print function with 2 arguments, $1 and concatenation of $5 and .txt. Whilst in

awk -v FS='\t' -v OFS='_' 'NR>1 {print > ($1,$5 ".txt") }'

there is not function for which to ram arguments. You might use sprintf string function which does format and return string following way

awk -v FS='\t' -v OFS='_' 'NR>1 {print > sprintf("%s%s%s.txt",$1,OFS,$5) }'

Upvotes: 3

Ed Morton
Ed Morton

Reputation: 204381

Using any awk, you should do the following if your $1 and $5 fields are unique per row:

awk -F '\t' 'NR>1 { out=$1 "_" $5 ".txt"; print > out; close(out) }'

and this otherwise:

awk -F '\t' 'NR>1 { out=$1 "_" $5 ".txt"; if (!seen[out]++) printf "" > out; print >> out; close(out) }'

The close() is so you don't end up with a "too many open files" error if your input is large. The printf "" > out is to empty/init the output file in case it already existed before your script ran.

With GNU awk you could get away without the close():

awk -F '\t' 'NR>1 { print > ($1 "_" $5 ".txt") }'

but the script will slow down significantly for large input as it tries to internally handle opening/closing all of the output files as-needed.

Upvotes: 4

Barmar
Barmar

Reputation: 782158

The expression ($1,$5 ".txt") is not valid.

You may be thinking that the comma operator concatenates its arguments using OFS as the separator. But that's not how OFS works. It's used as the separator when you give multiple arguments to the print command, but isn't used in expressions.

In expressions, the only concatenation operator is putting sub-expressions next to each other. If you want to concatenate with OFS you must write it explicitly.

awk -v FS='\t' -v OFS='_' 'NR>1 {print > ($1 OFS $5 ".txt") }'

You could also just write the literal instead of using OFS.

awk -v FS='\t' -v OFS='_' 'NR>1 {print > ($1 "_" $5 ".txt") }'

Upvotes: 4

Related Questions