Reputation: 1299
I don't understand.. it's funny but i'm don't understand ))
See below, please
echo -n '\\prj\prj.prjjmbr.Interp\PRIL_35.jpg' | awk -F ';' '{a=length($1);print lenght a}'
Output is 35. It's right
echo -n '\\prj\prj.prjjmbr.Interp\PRIL_35.jpg' | wc -c
Output is 35. It's right too
echo -n '\\prj\prj.prjjmbr.Interp\Very long path with cyrillic symbols\полученные данные_по проект\отчеты\Отчет \Dinam_interp_2D_yujo-vost_ch_Urabor-Yahinskij_LU_2008 ( GNPTs_PurGeo ) \Otchet\GrafPril\PRIL_35.jpg' | awk -F ';' '{print length ($1)}'
Output is 202.
echo -n '\\prj\prj.prjjmbr.Interp\Very long path with cyrillic symbols\полученные данные_по проект\отчеты\Отчет \Dinam_interp_2D_yujo-vost_ch_Urabor-Yahinskij_LU_2008 ( GNPTs_PurGeo ) \Otchet\GrafPril\PRIL_35.jpg' | wc -c
Output is 237.
Why with non latin symbols i'm getting a different results? How i can fix it?
p.s. After fix, i need use substr function i.e. substr (path, 10, 8);
Upvotes: 1
Views: 1762
Reputation: 9217
You are getting different results with non latin symbols because there is a difference between the number of characters in a string and the number of bytes in a string. wc -c
is returning the number of bytes, awk
is returning the number of characters.
Make sure you use the right number. If you need to store the string, you need to know the number of bytes. If you need to display a string, you may be more interested in the number of characters.
From man wc
:
-c, --bytes print the byte counts
From man awk
:
As of version 3.1.5, gawk is multibyte aware. This means that index(), length(), substr() and match() all work in terms of characters, not bytes.
Upvotes: 11
Reputation: 15238
I could reproduce your finding, and assumed it was locale related. Not a fix, but a "work-around" ...
echo -n '\\prj\prj.prjjmbr.Interp\Very long path with cyrillic symbols\полученные данные_по проект\отчеты\Отчет \Dinam_interp_2D_yujo-vost_ch_Urabor-Yahinskij_LU_2008 ( GNPTs_PurGeo ) \Otchet\GrafPril\PRIL_35.jpg' | LANG=C awk -F ';' '{print length ($1)}'
Upvotes: 0