Taygrim
Taygrim

Reputation: 409

Checking a character to be a newline

How to check whether a character is a newline character in any encoding in C?

I have a task to write my own wc program. And if I use just if (s[i] == '\n') it has another answer than original wc if I call it to itself.
Here is the code:

typedef struct
{
    int newline;
    int word;
    int byte;
} info;

info count(int descr)
{
    info kol;
    kol.newline = 0;
    kol.word = 0;
    kol.byte = 0;

    int len = 512;
    char s[512];
    int n;

    errno = 0;
    int flag1 = 1;
    int flag2 = 1;
    while(n = read(descr, s, len))
    {
        if(n == -1)
            error("Error while reading.", errno);

        errno = 0; 

        kol.byte+=n;
        for(int i=0; i<n; i++)
        {
            if(flag1)
            {
                kol.newline++;
                flag1 = 0;
            }

            if(isblank(s[i]) || s[i] == '\n')
                flag2 = 1;
            else
            {
                if(flag2)
                {
                    kol.word++;
                    flag2 = 0;
                }
            }
            if(s[i] == '\n')
                flag1 = 1;
        }
    }
    return kol;
}  

It works fine for all text files, but when I call it to file I got after compiling itself it does't give the answer wc gives.

Upvotes: 18

Views: 87354

Answers (3)

Ale
Ale

Reputation: 1839

As far as I know, there is no standard function like the isXXXXX() ones (the closest one is isspace(), which is true also for other conditions (space, tab, form feed...). Simply comparing to '\n' should solve your problem; depending on what you consider to be a newline character, you might also want to check for '\r' (carriage return). UNIX standard as line separator is '\n', Mac (before OS X) used '\r' (now '\n' is more common, but '\r' is sometimes still used by some applications, e.g. MS Office), DOS/Windows use the "\r\n" sequence.

Upvotes: 1

Keith Thompson
Keith Thompson

Reputation: 263617

The way to check whether a character s[i] is a newline character is simply:

if (s[i] == '\n')

If you're reading from a file that's been opened in text mode (including stdin), then whatever representation the underlying system uses to mark the end of a line will be translated to a single '\n' character.

You say you're trying to write your own wc program, and by comparing to '\n' you're getting different results than the system's wc. You haven't told us enough to guess why that's happening. Show us your code and tell us exactly what's happening.

You might run into problems if you're reading a file that's encoded differently -- say, trying to read a Unix-format text file on a Windows system. But then wc would have the same problem.

Upvotes: 17

Dave
Dave

Reputation: 46349

There are several newline characters in ASCII and Unicode.

The most famous are \r and \n, from ASCII. Technically these are carriage return and line-feed. Windows uses both together \r\n (technically carriage-return means go to column 0, line-feed means go to next line, but nothing I know of obeys that in practice), unix uses just \n. Some (not common) OSs use just \r.

Most apps stop there, and don't suffer for it. What follows is more theoretical.

Unicode complicates things. U+000A and U+000B are identical to \r and \n (same binary representation in UTF-8). Then there's U+0085 "next line", U+2028 "line separator" and U+2029 "paragraph separator". You can also check vertical tab (U+000B) if you want to check everything. See here: http://en.wikipedia.org/wiki/Newline#Unicode

Upvotes: 6

Related Questions