Evan Parsons
Evan Parsons

Reputation: 1209

Reading a "string in little-endian UTF-16 encoding" with BinaryReader

I am following this specification of this file format: https://github.com/rouault/dump_gdbtable/wiki/FGDB-Spec

utf16: string in little-endian UTF-16 encoding

How do I read this? I tried BinaryReader.ReadString() however it returns something along the lines of:

"\0e\0y\0w\0o\0r\0d\0\0 \0\0\0\0\rP\0a\0r\0a\0m\0e\0t\0e\0r\0N\0a\0m\0e\0\0 \0\0\0\0\fC\0o\0n\0f\0i\0g\0S\0t\0r\0"

That definitely isn't right.


From the specification:

ubyte: number of UTF-16 characters (not bytes) of the name of the field
utf16: name of the field
ubyte: number of UTF-16 characters (not bytes) of the alias of the field. Might be 0
utf16: alias of the field (ommitted if previous field is 0)
ubyte: field type ( 0 = int16, 1 = int32, 2 = float32, 3 = float64, 4 = string, 5 = datetime, 6 = objectid, 7 = geometry, 8 = binary, 9=raster, 10/11 = UUID, 12 = XML )

Could I somehow use the number of UTF-16 characters to read the name of the field?

Upvotes: 3

Views: 4489

Answers (2)

ulrichb
ulrichb

Reputation: 20054

BinaryReaders ReadString() method doesn't provide an overload where you can specify the string length (instead it assumes an encoded prefixed length, which doesn't match the format of the spec you linked).

Therefore, you cannot use ReadString() directly, but you can

  1. use ReadByte() to get the string (character) length,
  2. multiply it by 2,
  3. use ReadBytes(count),
  4. use Encoding.Unicode.GetString(bytes).

Upvotes: 3

Damien_The_Unbeliever
Damien_The_Unbeliever

Reputation: 239764

It should be:

BinaryReader br = new BinaryReader(File.Open("C:\\florida.gdb\\a00000002.gdbtable",
                                   FileMode.Open,
                                   FileAccess.Read,
                                   FileShare.Read | FileShare.Delete),
                      Encoding.Unicode);

Where Encoding is System.Text.Encoding.


For various historical reasons, Microsoft/Windows refer to UTF-16 (and, specifically, the little-endian variant) as "Unicode" rather than UTF-16.

Upvotes: 1

Related Questions