Ch33f
Ch33f

Reputation: 619

UTF-8 problems in writing a UART-Console on a microcontroller

I am currently writing a uart-console on an ATMega1284p. It supposed to echo the characters back, so that the computer-side-console actually sees what is being typed and that is it for now.

Here is the problem: With ASCII it works perfectly fine, but if I am sending anything beyond ASCII e.g. a '§' my minicom shows "�§" '�' being the invalid or the '§' in case everything works fine. But getting the combination of both throws me off and I currently have no idea where the problem is!

Here is part of my code:

    char c;
    while(m_uart->recv(c) > 0) {
        m_lineBuff[m_lineIndex++] = c;
        if(c == '\r') {
            c = '\n';
            m_lineBuff[m_lineIndex++] = c;
            m_sendCount = 2;
        } else {
            m_sendCount = 1;
        }
        this->send();
        if(c == '\n') {
            m_lineBuff[m_lineIndex++] = '\0';
            // invoke some callbacks that handle the line at some point
            m_lineIndex = 0;
        }
    }

m_lineBuff is a self written (and tested) vector of chars. m_uart is a self written (and also tested) UART driver for the micro-internal hardware uart. this->send sends m_sendCount bytes using m_uart.

What I tried so far: I verified that the baud rates of minicom and my micro match (115200). I verified that the frequency is within the 2% range (micro is running at 20MHz). Both minicom and the micro are setup for 8n1. I verified that minicom works by hooking it up to a little-board I had lying around. On that board any utf-8 digit works just fine.

Does anyone see my mistake or does anyone have a clue at what I haven't considered?

I'll be happy to supply up to all of my code if you guys are interested in it.

EDIT/Elaboration:

Observation 1 (prior to starting this project)

The PC side program (minicom) can send and recieve characters to resp. from the microcontroller. It does not show the sent characters though.

Conclusion 1 (prior to starting this project)

The microcontroller side needs to send the characters back to the PC, so that you have the behaviour of a console. Thus I immediately send back any character I get.

Observation 2 (after implementing it)

When I press '§' (or any other character consisting of more than 1 byte) (using minicom) I see "�§".

Conclusion 2 (after implementing it)

Something I can't explain with my knowledge is going on. Maybe a small delay between the two bytes making up the character lead to minicom printing a '�' first because the first byte on it's own is indeed an invalid character, and when the second character comes in minicom realizes that it's acutally '§' but minicom doesn't remove/overwrite the '�'. If that is the problem, then how do I solve it? Does my microcontroller need to react faster/with less delay in between characters?

EDIT2:

I replaced the '?' with the actual character '�' using the power of copy and paste.

More tests I did

I tried the character '😹' and as I expexted (it backs my conclusion 2) and I got "���😹". '😹' by the way is a 4 byte character. Set the baud rate of micro and minicom to 9600: exact same behaviour. I managed to set minicom into hex mode: it sends regularly but outputs hex... When I send '😹' I get "f0 9f 98 b9" which (at least according to this site) is correct... Is that backing my conclusion 2? And more importantly: how do I get rid of that behaviour. It works with my little linux board instead of my micro.

Upvotes: 2

Views: 3175

Answers (1)

Patrick Trentin
Patrick Trentin

Reputation: 7342

EDIT: the op discovered on his own that the odd behaviour he discovered is (probably) a bug of minicom itself. This post of mine clearly looses its value, unless the community thinks that it should be removed I would leave it here as a witness of possible workarounds when experiencing similar problems.


tl;dr: your pc application might not be interpreting UTF-8 correctly as it appears.


If we look at the Extended ASCII Code defined by ISO 8859-1,

A7 10100111 § § => Section sign

and according to this page, the UTF-8 encoding of § is

U+00A7 § c2 a7 => SECTION SIGN

So my educated guess is that the symbol is still printed correctly because it belongs to the Extended ASCII Code with the same value a7.

Either your end-application fails to correctly interpret the UTF-8 U (c2) symbol, and that's why you get an ? printed out, or a component in the middle fails to pass the correct value forward. I am inclined to believe your output is an instance of the first case.


You claim that minicom works, I can not refute this claim, but I would suggest you to try the following things first:

  1. try send a symbol that belongs to UTF-8 but not to the ISO 8859-1 standard: if it doesn't work, this should rule out your Conclusion #2 pretty immediately;
  2. try reduce the speed to the lowest possible, 9600 baud rate
  3. verify that minicom is correctly configured to interpret UTF-8 characters checking the documentation;
  4. try to use some other application to fetch data from your micro-controller and see whether the results are consistent;
  5. verify that the unicode symbol U you're sending out is correct

NB: this is kind of an incomplete answer, but I couldn't get everything in the comments. If you're patient enough, please update your question with your findings and comment this answer to notify me. I'll get back here and update my answer accordingly.

Upvotes: 3

Related Questions