Ryan McDonough
Ryan McDonough

Reputation: 10012

Byte to String - Obscure Characters

We store data as BLOBS in a database (ugh I know) now on my website I'm retreiving the data, putting into bytes then converting to a string to display. However as you can see below I'm getting weird characters in the text when viewing in debug mode.

Hi John
�
I look forward to receipt of your instructions in due course.
�
Kind regards
�

When it renders it shows like

Hi John�I look forward to receipt of your instructions in due course.�Kind regards�

Currently the code is:

Dim RSFileNote As New ADODB.Recordset
        RSFileNote.Fields.Append("FileNote", 205, intSizeofBlob)

        RSFileNote.Open()
        RSFileNote.AddNew()

        For n As Integer = 0 To dsVecSegment.Tables(0).Rows.Count - 1
            RSFileNote("FileNote").AppendChunk(dsVecSegment.Tables(0).Rows(n).Item("SDATA"))
        Next
        RSFileNote.Update()

        Dim vOut As String = System.Text.Encoding.UTF8.GetString(RSFileNote("FileNote").Value)

I would of thought the UTF8 encoding would resolve this issue, but does anyone know what I can do to resolve the issue on my side? (as getting the content in the database correct isn't an option)

Ideally I want to remove extraneous characters and replace the line breaks (that are in the .Value during debug) with line break that actually work.

Update

I think the issue lays with the fact emails are copy & pasted into the initial input field to store in the database. So they are carrying over artifacts from outlook into the field.

Update 2

Having taken Esailija answer into account it has removed the � icons, however the break lines are still mysteriously going missing.

I would post a full output however it contains private data, though with emails that have been pasted in the end of it is encoded with:

,wd-s.@ÓyøYð&¥¥ÀAàA•F•  €   p   IØ%Ð`ÐîèØMà!µì$ô#i!°p1¤ Ið-œ)) -„U€. x.y.)¨}U¹ M½!;¹4%;¨5˜6)˜2YA'8<1<8<9•=; !:$Ì78è#    Ùœ<ÐNÌ'Á',A yGÅC    ±]Õ 1 õH¥Ve„8¥9dN¹FMX   hX`Kè¸XÍ”U”dnÕU-€W@U`N%PDE 

Upvotes: 2

Views: 484

Answers (2)

Esailija
Esailija

Reputation: 140230

The unicode replacement character () indicates an error when decoding a byte sequence, that the byte sequence is not valid in the chosen UTF encoding, in this case UTF-8. So any invalid UTF-8 sequences are replaced with the replacement character in the result. It can also be used literally as a normal character, but this doesn't seem to be the case here.

The reason is most likely that the encoding is not UTF-8. Without seeing the raw bytes, my best guess is that it's actually in CP1252.

So try this:

Dim enc As Encoding = Encoding.GetEncoding(1252)
Dim vOut As String = enc.GetString(RSFileNote("FileNote").Value)

Also comment what the result is in 1252, because the raw bytes can usually be deduced from that.

Upvotes: 2

Phil Murray
Phil Murray

Reputation: 6554

Nasty fix but you could do this vOut = vOut.Replace("�", vbCrLf)

Upvotes: 2

Related Questions