Reputation: 409
How to check whether a character is a newline character in any encoding in C?
I have a task to write my own wc program. And if I use just if (s[i] == '\n')
it has another answer than original wc if I call it to itself.
Here is the code:
typedef struct
{
int newline;
int word;
int byte;
} info;
info count(int descr)
{
info kol;
kol.newline = 0;
kol.word = 0;
kol.byte = 0;
int len = 512;
char s[512];
int n;
errno = 0;
int flag1 = 1;
int flag2 = 1;
while(n = read(descr, s, len))
{
if(n == -1)
error("Error while reading.", errno);
errno = 0;
kol.byte+=n;
for(int i=0; i<n; i++)
{
if(flag1)
{
kol.newline++;
flag1 = 0;
}
if(isblank(s[i]) || s[i] == '\n')
flag2 = 1;
else
{
if(flag2)
{
kol.word++;
flag2 = 0;
}
}
if(s[i] == '\n')
flag1 = 1;
}
}
return kol;
}
It works fine for all text files, but when I call it to file I got after compiling itself it does't give the answer wc gives.
Upvotes: 18
Views: 87354
Reputation: 1839
As far as I know, there is no standard function like the isXXXXX()
ones (the closest one is isspace()
, which is true also for other conditions (space, tab, form feed...). Simply comparing to '\n'
should solve your problem; depending on what you consider to be a newline character, you might also want to check for '\r'
(carriage return). UNIX standard as line separator is '\n'
, Mac (before OS X) used '\r'
(now '\n'
is more common, but '\r'
is sometimes still used by some applications, e.g. MS Office), DOS/Windows use the "\r\n"
sequence.
Upvotes: 1
Reputation: 263617
The way to check whether a character s[i]
is a newline character is simply:
if (s[i] == '\n')
If you're reading from a file that's been opened in text mode (including stdin
), then whatever representation the underlying system uses to mark the end of a line will be translated to a single '\n'
character.
You say you're trying to write your own wc
program, and by comparing to '\n'
you're getting different results than the system's wc
. You haven't told us enough to guess why that's happening. Show us your code and tell us exactly what's happening.
You might run into problems if you're reading a file that's encoded differently -- say, trying to read a Unix-format text file on a Windows system. But then wc
would have the same problem.
Upvotes: 17
Reputation: 46349
There are several newline characters in ASCII and Unicode.
The most famous are \r
and \n
, from ASCII. Technically these are carriage return and line-feed. Windows uses both together \r\n
(technically carriage-return means go to column 0, line-feed means go to next line, but nothing I know of obeys that in practice), unix uses just \n
. Some (not common) OSs use just \r
.
Most apps stop there, and don't suffer for it. What follows is more theoretical.
Unicode complicates things. U+000A and U+000B are identical to \r
and \n
(same binary representation in UTF-8). Then there's U+0085 "next line", U+2028 "line separator" and U+2029 "paragraph separator". You can also check vertical tab (U+000B) if you want to check everything. See here: http://en.wikipedia.org/wiki/Newline#Unicode
Upvotes: 6