Jay Dee
Jay Dee

Reputation: 230

raspberry pi pico tcp client *** panic ***

I built a reasonably basic client server IOT setup in c++ with a raspberry pi 5 server and a few Raspberry pi pico-w clients. It's nothing special, just keeps track of some temperature and voltage readings for an off grid solar and hydropower system that powers a couple of houses.

I am struggling with the pico clients randomly crashing outputting *** PANIC *** now and then. If I could work out how catch and handle whatever the error is I could consider the system functional. Even a hard reboot and just start again would be better than as is having to pull the plug to get it working again.

This is the last few lines of serial output before the pico calls exit(1), I included 1 complete send/receive followed by a send/failedReceive but there is many identical successful send/receives

StartTasks...
task1: Connecting to access point JDnD
Connected to wifi with an IP address
MAC ADDRESS: D83ADD3E85F9
IP ADDRESS:  192.168.1.66
task2: Connecting to 192.168.1.1 port 8081 
task3: waiting for connection from server.
Connected to 192.168.1.1 port 8081 
task3: 
Send... 466
Type:TS;Module:D83ADD3E85F9;Device:2846BEA00600004C;Value:22.3C;IP:192.168.1.66;
Type:TS;Module:D83ADD3E85F9;Device:2835BAA006000092;Value:22.7C;IP:192.168.1.66;
Type:TS;Module:D83ADD3E85F9;Device:2873B6A006000074;Value:22.2C;IP:192.168.1.66;
Type:RL;Module:D83ADD3E85F9;Device:Relay1;Value:ON;IP:192.168.1.66;Enable:FALSE;Action:;
Type:RL;Module:D83ADD3E85F9;Device:Relay2;Value:ON;IP:192.168.1.66;Enable:;Action:;
Type:TM;Module:D83ADD3E85F9;Device:Timer1;Value:;
task4: waiting for response from server......
tcp_client_sent: len=466 
tcp_client_receive: len=357 err=0
Responce...
  Type;TS;Module;;Device;2846BEA00600004C;Value;22.3C;1;
  Type;TS;Module;;Device;2835BAA006000092;Value;22.3C;1;
  Type;TS;Module;;Device;2873B6A006000074;Value;22.3C;1;
  Type;RL;Module;;Device;Relay1;Value;ON;2;Group;Huntsman;Name;Fire1R;Enable;FALSE;2;Action;2873B6A006000074>25;
    Value:ON:2;
    Enable:FALSE:2;
  Type;RL;Module;;Device;Relay2;Value;OFF;1;Group;Huntsman;Name;Heater;Enable;FALSE;Action;;
  Type;TM;Module;;Device;Timer1;Value;;
task4: Close Client 
StartTasks...
task1: Connecting to access point JDnD
Connected to wifi with an IP address
MAC ADDRESS: D83ADD3E85F9
IP ADDRESS:  192.168.1.66
task2: Connecting to 192.168.1.1 port 8081 
task3: waiting for connection from server.
Connected to 192.168.1.1 port 8081 
task3: 
Send... 466
Type:TS;Module:D83ADD3E85F9;Device:2846BEA00600004C;Value:22.3C;IP:192.168.1.66;
Type:TS;Module:D83ADD3E85F9;Device:2835BAA006000092;Value:22.7C;IP:192.168.1.66;
Type:TS;Module:D83ADD3E85F9;Device:2873B6A006000074;Value:22.2C;IP:192.168.1.66;
Type:RL;Module:D83ADD3E85F9;Device:Relay1;Value:ON;IP:192.168.1.66;Enable:FALSE;Action:;
Type:RL;Module:D83ADD3E85F9;Device:Relay2;Value:ON;IP:192.168.1.66;Enable:;Action:;
Type:TM;Module:D83ADD3E85F9;Device:Timer1;Value:;
task4: waiting for responce from server.
*** PANIC ***

tcp_slowtmr: TIME-WAIT pcb->state == TIME-W

I grep searched the pico sdk for PANIC and found this function in 'rp2_common/pico_platform_panic/panic.c' and 'pico_platform/platform_base.c' that takes a variable number of args and outputs them to stdout

void panic(const char *fmt, ...)
{
    va_list args;
    puts("*** PANIC ***\n");
    if (fmt)
    {
        va_start(args, fmt);
        vprintf(fmt, args);
        va_end(args);
    }
    puts("\n");
    __breakpoint();
}

after '*** PANIC ***' I have a blank line, followed by 'tcp_slowtmr: TIME-WAIT pcb->state == TIME-W'

I grep search the sdk for calls to the panic function with no real success, I notice a few calls 'panic("");' and 'panic(NULL);' that could be relevant since there is a blank line output but nothing stands out pointing me in any real direction.

pico-sdk/lib/tinyusb/src/portable/raspberrypi/rp2040/rp2040_usb.c:108:    panic("ep %02X was already available", ep->ep_addr);
pico-sdk/lib/tinyusb/src/portable/raspberrypi/rp2040/rp2040_usb.c:307:    panic("Can't continue xfer on inactive ep %02X", ep->ep_addr);
pico-sdk/lib/tinyusb/src/portable/raspberrypi/rp2040/hcd_rp2040.c:163:    panic("Unhandled buffer %d\n", remaining_buffers);
pico-sdk/lib/tinyusb/src/portable/raspberrypi/rp2040/hcd_rp2040.c:246:    panic("Data Seq Error \n");
pico-sdk/lib/tinyusb/src/portable/raspberrypi/rp2040/hcd_rp2040.c:251:    panic("Unhandled IRQ 0x%x\n", (uint) (status ^ handled));
pico-sdk/lib/tinyusb/src/portable/raspberrypi/rp2040/hcd_rp2040.c:456:    panic("Invalid speed\n");
pico-sdk/lib/tinyusb/src/portable/raspberrypi/rp2040/hcd_rp2040.c:641:    panic("hcd_clear_stall");
pico-sdk/lib/tinyusb/src/portable/raspberrypi/rp2040/dcd_rp2040.c:352:    panic("Unhandled IRQ 0x%x\n", (uint) (status ^ handled));
pico-sdk/src/rp2040/pico_platform/include/pico/platform.h:25:#include "pico/platform/panic.h"
pico-sdk/src/host/hardware_sync/sync_core0_only.c:98:                     panic("Can't wait on irq for host core0 only implementation");
pico-sdk/src/host/pico_runtime/runtime.c:10:                              panic("Hard assert");
pico-sdk/src/host/pico_platform/platform_base.c:20:void panic(const char *fmt, ...)
pico-sdk/src/host/pico_platform/include/pico/platform.h:123:void __noreturn panic(const char *fmt, ...);
pico-sdk/src/rp2_common/pico_platform_panic/panic.c:22:                   panic("not supported");
pico-sdk/src/rp2_common/pico_platform_panic/panic.c:34:void __attribute__((naked, noreturn)) __printflike(1, 0) panic(__unused const char *fmt, ...)
pico-sdk/src/rp2_common/pico_platform_panic/panic.c:65:void __attribute__((noreturn)) __printflike(1, 0) panic(const char* fmt, ...)
pico-sdk/src/rp2_common/pico_platform_panic/include/pico/platform/panic.h:31:void __attribute__((noreturn)) panic(const char *fmt, ...);
pico-sdk/src/rp2_common/pico_runtime/runtime.c:12:                        panic("Hard assert");
pico-sdk/src/rp2_common/pico_lwip/include/arch/cc.h:89:#define LWIP_PLATFORM_ASSERT(x) panic(x)
pico-sdk/src/rp2_common/pico_cyw43_driver/cybt_shared_bus/cybt_shared_bus.c:275:        panic("cyw43 buffer overflow");
pico-sdk/src/rp2_common/pico_cyw43_driver/cybt_shared_bus/cybt_shared_bus_driver.c:414: panic("cyw43 btsdio register corruption");
pico-sdk/src/rp2_common/pico_cyw43_driver/cyw43_driver.c:185:             panic("cyw43 has no ethernet interface");
pico-sdk/src/rp2_common/pico_malloc/malloc.c:30:                          panic("Out of memory");
pico-sdk/src/rp2_common/pico_double/double_init_rom_rp2040.c:17:          panic("missing double function");
pico-sdk/src/rp2_common/pico_double/double_init_rom_rp2040.c:61:          panic(NULL);
pico-sdk/src/rp2_common/pico_double/double_none.S:77:    j  panic
pico-sdk/src/rp2_common/pico_double/double_none.S:81:    bl panic
pico-sdk/src/rp2_common/pico_float/float_none.S:75:    j  panic
pico-sdk/src/rp2_common/pico_float/float_none.S:79:    bl panic
pico-sdk/src/rp2_common/pico_float/float_init_rom_rp2040.c:18:           panic("");
pico-sdk/src/rp2_common/pico_float/float_init_rom_rp2040.c:38:           panic("");
pico-sdk/src/rp2_common/pico_printf/printf_none.S:22:    bl panic
pico-sdk/src/rp2_common/pico_printf/printf_none.S:25:    call panic
pico-sdk/src/rp2_common/pico_async_context/async_context_poll.c:50:      panic("async_context_poll context check failed (IRQ or wrong core)");
pico-sdk/src/rp2_common/pico_multicore/multicore.c:335:                  panic( "Multicoore doorbell %d already claimed on core mask 0x%x; requested core mask 0x%x\n",
pico-sdk/src/rp2_common/pico_multicore/multicore.c:366:                  panic("No free doorbells");
pico-sdk/src/rp2_common/pico_stdio_uart/stdio_uart.c:62:                 panic("UART baud rate undefined");
pico-sdk/src/rp2_common/pico_stdio_uart/stdio_uart.c:73:                 panic("UART baud rate undefined");
pico-sdk/src/rp2_common/pico_stdio_uart/stdio_uart.c:85:                 panic("UART baud rate undefined");
pico-sdk/src/rp2_common/hardware_clocks/clocks.c:202:                    panic("Unexpected clocks irq\n");
pico-sdk/src/rp2_common/hardware_clocks/include/hardware/clocks.h:442:   panic("System clock of %u Hz cannot be exactly achieved", freq_hz);
pico-sdk/src/rp2_common/hardware_clocks/include/hardware/clocks.h:464:   panic("System clock of %u kHz cannot be exactly achieved", freq_khz);
pico-sdk/src/rp2350/pico_platform/include/pico/platform.h:25:#include "pico/platform/panic.h"
pico-sdk/src/common/pico_time/time.c:387:                                panic("Attempted to sleep inside of an exception handler; use busy_wait if you must");
pico-sdk/src/common/hardware_claim/claim.c:24:                           panic(message, bit_index);
pico-sdk/src/common/hardware_claim/claim.c:44:                           panic(message);

noticing the panic is followed by 'tcp_slowtmr: TIME-WAIT pcb->state == TIME-W' I have found these lines that look like they could be relevant,

pico-sdk/lib/lwip/src/core/tcp.c:1445:                           LWIP_ASSERT("tcp_slowtmr: TIME-WAIT pcb->state == TIME-WAIT", pcb->state == TIME_WAIT);
pico-sdk/lib/btstack/3rd-party/lwip/core/src/core/tcp.c:1445:    LWIP_ASSERT("tcp_slowtmr: TIME-WAIT pcb->state == TIME-WAIT", pcb->state == TIME_WAIT);
/**
 * Called every 500 ms and implements the retransmission timer and the timer that
 * removes PCBs that have been in TIME-WAIT for enough time. It also increments
 * various timers such as the inactivity timer in each PCB.
 *
 * Automatically called from tcp_tmr().
 */
void
tcp_slowtmr(void);
/* Steps through all of the TIME-WAIT PCBs. */
  prev = NULL;
  pcb = tcp_tw_pcbs;
  while (pcb != NULL) {
    LWIP_ASSERT("tcp_slowtmr: TIME-WAIT pcb->state == TIME-WAIT", pcb->state == TIME_WAIT);
    pcb_remove = 0;

    /* Check if this PCB has stayed long enough in TIME-WAIT */
    if ((u32_t)(tcp_ticks - pcb->tmr) > 2 * TCP_MSL / TCP_SLOW_INTERVAL) {
      ++pcb_remove;
    }

    /* If the PCB should be removed, do it. */
    if (pcb_remove) {
      struct tcp_pcb *pcb2;
      tcp_pcb_purge(pcb);
      /* Remove PCB from tcp_tw_pcbs list. */
      if (prev != NULL) {
        LWIP_ASSERT("tcp_slowtmr: middle tcp != tcp_tw_pcbs", pcb != tcp_tw_pcbs);
        prev->next = pcb->next;
      } else {
        /* This PCB was the first. */
        LWIP_ASSERT("tcp_slowtmr: first pcb == tcp_tw_pcbs", tcp_tw_pcbs == pcb);
        tcp_tw_pcbs = pcb->next;
      }
      pcb2 = pcb;
      pcb = pcb->next;
      tcp_free(pcb2);
    } else {
      prev = pcb;
      pcb = pcb->next;
    }
  }

I cannot work out how to tie this background process timing out to were panic is called, or how I am supposed to handle the timeout error.

I expect the answer is simple but my question is, how am I supposed to catch this call to panic and handle it rather than letting the background task crash the module and exit?

My client main loop is single threaded pico_cyw43_arch_poll

int loop()
{
    cyw43_arch_poll();
    cyw43_arch_wait_for_work_until(make_timeout_time_ms(1));
    int err = task.Next();
    if (err) { JDND::Terminal::Error("ERROR: task.Next err=" + JDND::itoa(err)); }
    // run timer tasks
    if (startTasks)
    {
        JDND::Terminal::Debug("StartTasks...");
        startTasks = false;
        timer_start(10);
        if (!task.IsActive())
        {
            int err = task.Start(task1);
            if (err) { JDND::Terminal::Error("ERROR: task.Start " + JDND::itoa(err)); }
        }
    }
}

my server is also single threaded and only handles one client at a time but there is only a few modules and even with only 1 module running the problem still exists.

I'm thinking of making a backup copy of the sdk so I can put it back as is, then adding an extra line just before each call to panic() outputting a number or something, so I can see which function is actually firing it. do you think this is viable as a test?

Upvotes: 0

Views: 257

Answers (1)

Jay Dee
Jay Dee

Reputation: 230

Finally fixed it.

missing null terminator from the end of the response from the server back to the peco.

std::string result = JDND::Strings::Join(lines, '\n');
status = write(sock_client, result.c_str(), result.length());

notice the plus 1

std::string result = JDND::Strings::Join(lines, '\n');
status = write(sock_client, result.c_str(), result.length()+1);

the peco sends a packet 'DATA\nDATA\nDATA\0' and received back 'DATA\nDATA\nDATA\nUNDEFINED\0'. iterated over the response checking each DATA then doped out, ignoring the last UNDEFINED line gone unnoticed.

most of the time it did not cause any bother but I guess depending on how far away the next undefined \0 happen to be determined if it would fit in the allocated buffer on the peco or not.

UPDATE:

I have since realized that well sending the null terminator in the packet fixed the problem, it was still stupid as that makes the clients rely on the server to do so. adding the null terminator to the end of the the received packet on the pico client upon receiving data is the real fix

static void client_receive(void* arg, struct udp_pcb* pcb, struct pbuf* p, const ip_addr_t* addr, u16_t port)
{
    if (p == nullptr) { return; }
    cyw43_arch_lwip_check();
    
    char* data = (char*)p->payload; // ADDED THIS LINE
    data[p->tot_len] = '\0'; // ADDED THIS LINE
    
    client_received = std::string((const char*)p->payload);
    pbuf_free(p);
}

Upvotes: 1

Related Questions