awk | why length wrong?

Question

I don't understand.. it's funny but i'm don't understand ))

See below, please

echo -n '\prj\prj.prjjmbr.Interp\PRIL_35.jpg' | awk -F ';' '{a=length($1);print lenght a}'

Output is 35. It's right

echo -n '\prj\prj.prjjmbr.Interp\PRIL_35.jpg' | wc -c

Output is 35. It's right too

echo -n '\prj\prj.prjjmbr.Interp\Very long path with cyrillic symbols\полученные данные_по проект\отчеты\Отчет \Dinam_interp_2D_yujo-vost_ch_Urabor-Yahinskij_LU_2008 ( GNPTs_PurGeo ) \Otchet\GrafPril\PRIL_35.jpg' | awk -F ';' '{print length ($1)}'

Output is 202.

echo -n '\prj\prj.prjjmbr.Interp\Very long path with cyrillic symbols\полученные данные_по проект\отчеты\Отчет \Dinam_interp_2D_yujo-vost_ch_Urabor-Yahinskij_LU_2008 ( GNPTs_PurGeo ) \Otchet\GrafPril\PRIL_35.jpg' | wc -c

Output is 237.

Why with non latin symbols i'm getting a different results? How i can fix it?

p.s. After fix, i need use substr function i.e. substr (path, 10, 8);

Peter Lundgren · Accepted Answer

You are getting different results with non latin symbols because there is a difference between the number of characters in a string and the number of bytes in a string. wc -c is returning the number of bytes, awk is returning the number of characters.

Make sure you use the right number. If you need to store the string, you need to know the number of bytes. If you need to display a string, you may be more interested in the number of characters.

From man wc:

-c, --bytes print the byte counts

From man awk:

As of version 3.1.5, gawk is multibyte aware. This means that index(), length(), substr() and match() all work in terms of characters, not bytes.

awk | why length wrong?

Answers (2)

Related Questions