zaid
zaid

Reputation: 6339

C++ Text File, Chinese characters

I have a C++ project which is supposed to add <item> to the beginning of every line and </item > to the end of every line. This works fine with normal English text, but I have a Chinese text file I would like to do this to, but it does not work. I normally use .txt files, but for this I have to use .rtf to save the Chinese text. After I run my code, it becomes gibberish. Here's an example.

{\rtf1\adeflang1025\ansi\ansicpg1252\uc1\adeff31507\deff0\stshfdbch31506\stshfloch31506\stshfhich31506\stshfbi31507\deflang1033\deflangfe1033\themelang1033\themelangfe0\themelangcs0{\fonttbl{\f2\fbidi \fmodern\fcharset0\fprq1{*\panose 02070309020205020404}Courier New;}

Code:

int main()
{
    ifstream in;
    ofstream out;
    string lineT, newlineT;

    in.open("rawquote.rtf");
    if(in.fail())
       exit(1);
    out.open("itemisedQuote.rtf");
    do
    {
        getline(in,lineT,'\n');
        newlineT += "<item>";
        newlineT += lineT;
        newlineT += "</item>";
        if (lineT.length() >5)
        {
            out<<newlineT<<'\n';
        }
        newlineT = "";
        lineT = "";
    } while(!in.eof());
    return 0;
}

Upvotes: 0

Views: 2881

Answers (5)

Hans Passant
Hans Passant

Reputation: 941397

It's kind of a miracle that this works for non-Chinese text. "\n" is not the line separator in RTF, "\par" is. The odds that more damage is done to the RTF header are certainly greater for Chinese.

C++ is not the best language to tackle this. It is a trivial 5 minute program in C# as long as the file doesn't get too large:

using System;
using System.Windows.Forms;   // Add reference

class Program {
    static void Main(string[] args) {
        var rtb = new RichTextBox();
        rtb.LoadFile(args[0], RichTextBoxStreamType.RichText);
        var lines = rtb.Lines;
        for (int ix = 0; ix < lines.Length; ++ix) {
            lines[ix] = "<item>" + lines[ix] + "</item>";
        }
        rtb.Lines = lines;
        rtb.SaveFile(args[0], RichTextBoxStreamType.RichText);
    }
}

If C++ is a hard requirement then you'll have to find an RTF parser.

Upvotes: 1

Dave Mateer
Dave Mateer

Reputation: 17946

If I'm understanding the objective of this code, your solution is not going to work. A line break in an RTF document does not correspond to a line break in the visible text.

If you can't just use plain text (Chinese characters are not a problem with a valid encoding), take a look at the RTF spec. You'll discover that it is a nightmare. So you're best bet is probably a third-party library that can parse RTF and read it "line" by "line." I have never looked for such a library, so do not have any suggestions off the top of my head, but I'm sure they are out there.

Upvotes: 0

Mario
Mario

Reputation: 36487

You can't read the RTF code the same way as plain text as you'll just ignore format tags, etc. and might just break the code.

Try to save your chinese text as a text file using UTF-8 (without BOM) and your code should work. However this might fail if some other UTF-8 encoded character contains essentially a line break (not sure about this part right now), so you should try to do real UTF-8 conversion and read the file using wide chars instead of regular chars (as Chan suggested), which is a little bit tricky using C++.

Upvotes: 1

roxrook
roxrook

Reputation: 13853

I think you should use 'wchar' for string instead of 'regular char'.

Upvotes: 0

Nim
Nim

Reputation: 33655

That looks like RTF, which makes sense as you say this is an rtf file.

Basically, if you dump that file when you open, you'll see it looks like that...

Also, you should revisit your loop

std::string line;
while(getline(in, line, '\n'))
{
  // do stuff here, the above check correctly that you have indeed read in a line!
  out << "<item>" << line << "</item>" << endl;
}

Upvotes: 1

Related Questions