Reputation: 1773
. I am new to c++. I have to find out the type of encoding the file contains which is passed by user. But i am not aware of how to check the encoding of file . so what i need is to print whether the file is unicode or ansi or unicode big endian or utf8.I have searched a lot but unable to find the solution. Till now i have done is i have opened a file :
#include "stdafx.h"
#include <iostream.h>
#include <stdio.h>
#include<conio.h>
#include <fstream>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
fstream f;
f.open("c:\abc.txt", fstream::in | fstream::out); /* Read-write. */
getch();
return 0;
}
SO please can anyone tell me the code solution to this.
what if i am accessing notepad file?
Thanx in advance..
Upvotes: 3
Views: 7210
Reputation: 1773
Here i have found a way to detect the notepad file ,whether it is Unicode,Big Endian,UTF8 or simple ANSI file:
I found that when i save file in notepad by default it stores Byte of Mark(BOM) at the start of file.So i decided to use it as per earlier suggestions in this question.
First of all i read 1 byte of my file. I was already known that
here is code :
#include<conio.h>
#include<stdio.h>
#include<string.h>
int main()
{
FILE *fp=NULL;
int c;
int i = 0;
fp=fopen("c:\\abc.txt","rb");
if (fp != NULL)
{
while (i<=3)
{
c = fgetc(fp);
printf("%d",c);
if(c==254)
{
printf("Unicode Big Endian File");
}
else if(c==255)
{
printf("Unicode Little Endian File");
}
else if(c==239)
{
printf("UTF8 file");
}
else
{
printf("ANSI File");
}
}
fclose(fp);
}
getchar();
return 0;
}
This worked fine for me.Hope will work for others also.
Upvotes: 2
Reputation: 21
open your file with Notepad++ and go to the Encoding on the top menu to see the encoding type of the file See here
Upvotes: -1
Reputation: 50667
As discussed here, the only thing you can do is guess
in the best order which is most likely to throw out invalid matches.
You should check, in this order:
Upvotes: 3
Reputation: 2723
Files generally indicate their encoding with a file header.
And as others suggested you can never be sure what encoding a file is really using.
Follow these links to get a general idea :
Using Byte Order Marks
FILE SIGNATURES TABLE
Upvotes: 1
Reputation: 12212
You cannot know what a encoding a text file has. One way to do it would be to look for the BOM at the beginning of the file, and that would tell you whether the text is in Unicode. However, the BOM is not mandatory, so you cannot rely on that in order to differentiate Unicode from other encodings.
A very common way to present this problem is that there is no such thing as plain text.
I'm Spanish, and you can easily find here text files in 7-bit ASCII, extended ASCII, ISO-8859-1 (aka Latin 1, which includes many common extra characters needed for western europe), and also UTF in its varios flavours.
Hope this somehow helps.
Upvotes: 1