Why Most AMRs Fail in Production (And How We Fixed Ours)

I’ve spent the last year building devibot — Peribott’s autonomous mobile robot — from the ground up. Every layer: STM32 firmware, FreeRTOS scheduling, CANopen motor control, ROS2 navigation, dashboard, cloud platform. And I’ve broken every layer at least twice, usually at 2am, usually with a deadline coming.

This article is what I wish someone had told me before I started. Not the theoretical version of AMR development. The production version — where the failures are specific, expensive, and completely predictable in hindsight.

The fundamental mistake: demo-first thinking

Most AMR projects start the same way. You get the robot moving. You get SLAM working. You demo it navigating between two points in a clean environment. It works. Everyone is happy. You ship it to the customer.

Then it fails. Not spectacularly — just quietly, in ways that are hard to debug remotely and expensive to fix on-site.

The gap between “it works in the lab” and “it works in production” is where most AMR projects die. Here are the specific failure modes I encountered, and how we fixed them in devibot.

Failure 1: The NaN that crashed the dashboard

This one took us three days to find. The devibot dashboard was dropping its WebSocket connection intermittently — sometimes after 2 hours, sometimes after 20 minutes. No pattern. No obvious error.

The root cause: an ultrasonic sensor returning NaN when the robot was in a corner with no clear echo. That NaN value flowed through our FastAPI backend, got serialised into JSON, and broke the WebSocket frame. JSON doesn’t support NaN — it’s not valid JSON. The client received a malformed frame and dropped the connection.

The fix: A sanitisation layer before every JSON serialisation call. All sensor values are checked — NaN, Inf, and -Inf replaced with null or a safe default. One function, called everywhere. We haven’t had an unexplained WebSocket drop since.

def sanitise_float(val, default=0.0):
    if val is None or math.isnan(val) or math.isinf(val):
        return default
    return round(val, 4)

Failure 2: The mutex inside the ISR

This one was nastier. The STM32 was crashing occasionally — HardFault, no obvious cause. It happened maybe once every 4 hours of operation. Completely unreproducible in testing.

The cause: we had a debug logging call inside HAL_FDCAN_RxFifo0Callback(). That callback runs in interrupt context. The debug call was trying to acquire a FreeRTOS mutex to protect a shared log buffer. Calling xSemaphoreTake() from an ISR is undefined behaviour in FreeRTOS. Sometimes it worked. Sometimes it corrupted the scheduler state. Sometimes it caused a HardFault.

The rule in FreeRTOS is absolute: never call blocking API functions from an ISR. Use the FromISR variants, and even then, only for non-blocking operations.

// WRONG: blocks inside ISR
void HAL_FDCAN_RxFifo0Callback(FDCAN_HandleTypeDef *hfdcan, uint32_t RxFifo0ITs) {
    xSemaphoreTake(log_mutex, portMAX_DELAY); // CRASH
    log_message("CAN received");
    xSemaphoreGive(log_mutex);
}

// RIGHT: post to queue, handle outside ISR
void HAL_FDCAN_RxFifo0Callback(FDCAN_HandleTypeDef *hfdcan, uint32_t RxFifo0ITs) {
    BaseType_t xHigherPriorityTaskWoken = pdFALSE;
    xQueueSendFromISR(can_rx_queue, &frame, &xHigherPriorityTaskWoken);
    portYIELD_FROM_ISR(xHigherPriorityTaskWoken);
}

Failure 3: The boot sequence that left robots half-alive

Early devibot deployments had a painful startup ritual: SSH in, source the workspace, run ros2 launch, open a browser, point it at the right IP. Miss any step and the robot was in a broken state — some nodes running, some not, no clear indication of what had failed.

For a single development robot, this is annoying. For a fleet of 10 robots in a warehouse, it’s operationally unacceptable.

We built the AMR Boot System to solve this. Six orchestrated phases that run automatically from power-on. Each phase verifies the previous one before proceeding. If anything fails, the robot enters safe mode automatically — starting a diagnostics dashboard that tells the operator exactly what went wrong, through the touchscreen, without SSH.

Boot to fully operational dashboard: under 90 seconds. Zero manual steps. This is now standard in every devibot deployment.

Failure 4: CANopen address confusion on the Chinese BMS

The Pylontech BMS we integrated uses 29-bit extended CAN frames. Our STM32 FDCAN global filter was configured for 11-bit standard frames. Result: the BMS was transmitting perfectly; the STM32 was ignoring every single message. The bus looked healthy on a CAN sniffer. The firmware was just silently discarding everything.

// Wrong: only accepts standard 11-bit frames
FDCAN_FilterTypeDef filter = {
    .IdType = FDCAN_STANDARD_ID,  // BMS uses EXTENDED_ID
    ...
};

// Correct: accept extended 29-bit frames
FDCAN_FilterTypeDef filter = {
    .IdType = FDCAN_EXTENDED_ID,
    .FilterIndex = 0,
    .FilterType = FDCAN_FILTER_RANGE,
    .FilterConfig = FDCAN_FILTER_TO_RXFIFO0,
    .FilterID1 = 0x00000000,
    .FilterID2 = 0x1FFFFFFF,
};

The pattern behind all of these failures

Every failure above had the same underlying cause: an assumption that was true in development but false in production. The NaN assumption: sensors always return valid floats. The ISR assumption: it’s fine to use standard RTOS calls in interrupt context. The boot assumption: someone technical will always be present to start the robot. The CAN filter assumption: all devices use the same frame format.

Production systems eliminate assumptions. They handle every edge case, not just the common path. They fail safely and visibly, not silently and mysteriously.

Building devibot to this standard took longer than building the prototype. But it’s the only version worth deploying.

Amit Jagnani is the Founder & CTO of Peribott Dynamic LLP, building the devibot AMR platform in Hyderabad, India. If you’re deploying autonomous robots and want to talk through your architecture, reach out directly.