user3362840
user3362840

Reputation: 151

Why do I get weird output in printf in awk for $0?

The input is following

Title: Aoo Boo

Author: First Last

I am trying to output

Aoo Boo, First Last, "

by using awk like this

awk 'BEGIN { FS="[:[:space:]]+" }
/Title/ { sub(/^Title: /,""); t = $0; } # save title
/Author/{ sub(/^Author: /,""); printf "%s,%s,\"\n", t, $0} 
' t.txt

But the output is like ,"irst Last. Basically it prints everything from the beginning of the sentence.

But if I change $0 to $2, the output is as expected which is Boo,Last,"

Why is it incorrect? What is the right way to do?

Upvotes: 1

Views: 1121

Answers (2)

ghoti
ghoti

Reputation: 46856

This assumes there are no colons in titles or names...

awk -F': *' '
  $1=="Title" {
    sub(/[^[:print:]]/,"");
    t=$2;
  }
  $1=="Author" {
    sub(/[^[:print:]]/,"");
    printf("%s, %s\n", t, $2);
  }
' inputfile.txt

This works by finding the title and storing it in a variable, then finding the author and using that as a trigger to print everything according to your format. You can alter the format as you see fit.

It may break if there are extra colons on the line, as the colon is being used to split fields. It may also break if your input doesn't match your example.

Perhaps the most important thing in this example is the sub(...) functions, which strip off non-printable characters like the carriage return that rici noticed you have. The regular expression [^[:print:]] matches "printable" characters, which the carriage return is not. This script will substitute them into oblivion if they're there, but should do no harm if they are not.

Upvotes: 0

rici
rici

Reputation: 241791

You need to get rid of the Windows line endings in your text file if you want to use Unix utilities.

If you're lucky, you'll find you have the dos2unix program installed, and you'll only need to do this:

dos2unix t.txt

If not, you could do it with tr:

tr -d '\r' < t.txt > new_t.txt

For reference, what is going on is that Windows files have \r\n at the end of every line (actually, a CR control code followed by a NL control code). On Linux, the lines ends with the \n, so the \r is part of the data; when you print it out, the terminal interprets as a "carriage return", which moves the cursor to the beginning of the current line, rather than advancing to the next line. Since the value of t ends with a \r, the following text overwrites the value of t.

It works with $2 because you've reassigned FS to include [:space:]; that definition of field separators is more generous than the awk default, since it includes \r and \f, neither of which are default field separators. Consequently, $2 does not contain the \r, but $0 does.

Upvotes: 3

Related Questions