Anonymous
Anonymous

Reputation: 1773

How to find a encoding of txt file in c++?

. I am new to c++. I have to find out the type of encoding the file contains which is passed by user. But i am not aware of how to check the encoding of file . so what i need is to print whether the file is unicode or ansi or unicode big endian or utf8.I have searched a lot but unable to find the solution. Till now i have done is i have opened a file :

#include "stdafx.h"
#include <iostream.h>
#include <stdio.h>
#include<conio.h>
#include <fstream>
using namespace std;



int _tmain(int argc, _TCHAR* argv[])
{
    fstream f;
    f.open("c:\abc.txt", fstream::in | fstream::out); /* Read-write. */


    getch();
    return 0;
}

SO please can anyone tell me the code solution to this.

what if i am accessing notepad file?

Thanx in advance..

Upvotes: 3

Views: 7210

Answers (6)

Anonymous
Anonymous

Reputation: 1773

Here i have found a way to detect the notepad file ,whether it is Unicode,Big Endian,UTF8 or simple ANSI file:

I found that when i save file in notepad by default it stores Byte of Mark(BOM) at the start of file.So i decided to use it as per earlier suggestions in this question.

First of all i read 1 byte of my file. I was already known that

  1. if file is Unicode file then its first two charactors stores FE FF i.e.254 255 is decimal equivalent of it.
  2. if file is UTF8 file then its first charactors stores FF and 239 is decimal equivalent of it.

here is code :

#include<conio.h>
#include<stdio.h>
#include<string.h>
int main()
{
        FILE *fp=NULL;
        int c;
        int i = 0;
        fp=fopen("c:\\abc.txt","rb");

        if (fp != NULL)
        {
            while (i<=3)
            {
                        c = fgetc(fp);    
                        printf("%d",c);
                            if(c==254)
                            {
                                printf("Unicode Big Endian File");
                            }
                            else if(c==255)
                            {
                                printf("Unicode Little Endian File");
                            }
                            else if(c==239)
                            {
                                printf("UTF8  file");
                            }
                            else 
                            {
                                printf("ANSI File");
                            }

              }
              fclose(fp);

       }

        
        getchar();

    return 0;
}

This worked fine for me.Hope will work for others also.

Upvotes: 2

Shah_MRI
Shah_MRI

Reputation: 21

open your file with Notepad++ and go to the Encoding on the top menu to see the encoding type of the file See here

Upvotes: -1

herohuyongtao
herohuyongtao

Reputation: 50667

As discussed here, the only thing you can do is guess in the best order which is most likely to throw out invalid matches.

You should check, in this order:

  • Is there a UTF-16 BOM at the beginning? Then it's probably UTF-16. Use the BOM as indicator whether it's big endian or little endian, then check the rest of the file whether it conforms.
  • Is there a UTF-8 BOM at the beginning? Then it's probably UTF-8. Check the rest of the file.
  • If the above didn't result in a positive match, check if the entire file is valid UTF-8. If it is, it's probably UTF-8.
  • If the above didn't result in a positive match, it's probably ANSI.

Upvotes: 3

Aseem Goyal
Aseem Goyal

Reputation: 2723

Files generally indicate their encoding with a file header.
And as others suggested you can never be sure what encoding a file is really using.

Follow these links to get a general idea :
Using Byte Order Marks
FILE SIGNATURES TABLE

Upvotes: 1

Baltasarq
Baltasarq

Reputation: 12212

You cannot know what a encoding a text file has. One way to do it would be to look for the BOM at the beginning of the file, and that would tell you whether the text is in Unicode. However, the BOM is not mandatory, so you cannot rely on that in order to differentiate Unicode from other encodings.

A very common way to present this problem is that there is no such thing as plain text.

I'm Spanish, and you can easily find here text files in 7-bit ASCII, extended ASCII, ISO-8859-1 (aka Latin 1, which includes many common extra characters needed for western europe), and also UTF in its varios flavours.

Hope this somehow helps.

Upvotes: 1

oleksii
oleksii

Reputation: 35905

You cannot.

The best thing you can do is to guess it or save encoding as part of your file structure (if you can).

Upvotes: 5

Related Questions