migajek
migajek

Reputation: 8614

Delphi, charset detection ([Uni]SynEdit) - Utf8Decode problem

I'm using Unicode SynEdit, which (in theory) has basic file/stream encoding detection. It worked fine until I tried opening the file which was generated by my PHP script. The file I'm talking about is detected by UniSynEdit as utf8 with no BOM. Unfortunately, it doesn't open - the loaded string is empty. I debugged it, and it seems that the problem is the function Utf8Decode, which fails for some reason and returns empty string. I've also checked the file with hex editor, and it's true: it has no BOM, all the normal characters are encoded using one byte, while some polish letters I had in the file (like "ł") are double-byte...

What could be wrong, and how can I prevent this? I believe wrong encoding loaded is better than no file at all...

Upvotes: 1

Views: 3851

Answers (1)

mghie
mghie

Reputation: 32334

If you really want to load files that are not correctly UTF-8 encoded, then you need to use a function that does not return an empty result for a string containing invalid byte sequences, but does instead replace them a replacement character. See the Wikipedia entry on UTF-8, in particular the section on "Invalid byte sequences".

Unfortunately the Delphi 2009 (don't have Delphi 7 to check there) UTF8Decode() calls MultibyteToWideChar(CP_UTF8, ...) internally, which returns an error on invalid byte sequences.

What you'd have to do is to use an alternative encoding function. Maybe there's something in one of the third party Delphi libraries that have their own decoding functions. Maybe you could use one of the linked libraries here. If all else fails you could write your own, maybe based on this code from the Unicode consortium.

Upvotes: 3

Related Questions