Edge AI on a Power Budget

The edge AI revolution is not about putting GPT on a Raspberry Pi. It is about running the right model, on the right hardware, within a strict power budget — and getting reliable results in the field, not just in the lab.

If you are building a battery-powered or thermally constrained product that needs on-device intelligence, this is the playbook we wish we had when we started.

The Project: Radar-Based Health Monitoring for Elder Care

Everything in this article is grounded in a real product we architected and deployed: a contactless health-monitoring system for elderly care. The device sits in a room — a bedroom, a bathroom, a living area — and uses a 60-GHz mmWave radar to detect falls, monitor vital signs (respiration rate, heart rate), and track activity patterns, all without cameras or wearables.

The system fuses data from three sensor modalities — mmWave radar point clouds, an IMU (accelerometer + gyroscope), and environmental sensors (temperature, humidity, barometric pressure) — and runs a convolutional neural network on-device to classify events in real time. The target deployment is homes and elder-care facilities across India, where the device must:

Respond within 500 ms of a fall event — cloud round-trips are too slow and unreliable
Preserve privacy — no cameras, no audio, and raw sensor data never leaves the device
Run 24/7 on under 2 watts — silent, fanless, suitable for a bedroom
Work offline — Wi-Fi in Indian homes and care facilities is not guaranteed

These constraints — latency, privacy, power, and connectivity — made edge inference the only viable architecture. Every technical decision in this article flows from this context.

Why Edge, Not Cloud?

Before diving into implementation, it is worth understanding why edge inference matters for sensor-heavy products:

Latency: A cloud round-trip adds 50–300 ms of unpredictable latency. For fall detection, that delay can mean the difference between a timely alert and a missed event. Our target was sub-200 ms end-to-end — sensor input to classification output.
Privacy: Health monitoring data — radar point clouds, heart-rate variability, respiration patterns — should never leave the device unless absolutely necessary. Edge inference keeps raw data local.
Connectivity dependence: Our system is deployed in homes and elder-care facilities where Wi-Fi can be unreliable. The system must function fully offline.
Recurring cost: Cloud inference at scale is expensive. A device that processes everything locally has a one-time BOM cost, not a monthly API bill.

Selecting the Right Hardware

The hardware decision is the single most consequential choice for edge AI. It determines your power envelope, your model complexity ceiling, and your software stack.

ARM Cortex-A vs Cortex-M

Feature	Cortex-A (e.g., A53, A72)	Cortex-M (e.g., M4, M7, M33)
OS	Full Linux (Yocto, Buildroot)	Bare-metal or RTOS (FreeRTOS, Zephyr)
Clock	1.0–2.0 GHz	100–600 MHz
RAM	512 MB–4 GB	256 KB–2 MB
Power	0.5–3 W typical	10–200 mW typical
ML sweet spot	CNNs, RNNs, transformer distillations	Tiny classifiers, keyword spotting, anomaly detection

For our health-monitoring product, we chose a Cortex-A53 quad-core (NXP i.MX 8M Mini) running Yocto Linux. The sensor fusion pipeline — fusing mmWave radar point clouds with IMU data and environmental sensors — demanded the memory and processing headroom that only a Cortex-A class processor could provide. A Cortex-M would have been sufficient for a single-sensor keyword-spotting application, but not for multi-modal fusion with a CNN backbone.

Neural Processing Units and DSPs

Many modern SoCs include dedicated NPUs or DSPs that can accelerate ML inference at a fraction of the power cost of running the same workload on the main CPU. The i.MX 8M Plus, for instance, includes a 2.3 TOPS NPU. However, NPU support in ML frameworks can be uneven — operator coverage, quantisation compatibility, and driver maturity vary widely. We evaluated the NPU path but ultimately chose CPU-only inference with aggressive model optimisation, as it provided more predictable behaviour and simpler debugging during development. The NPU remains a future optimisation lever.

ML Pipeline: From Training to Edge Deployment

Model Optimisation: Fitting Intelligence into Milliwatts

A stock ResNet-50 requires ~4 GFLOPs per inference. That is far too expensive for a battery-powered device. The goal is to reduce compute and memory requirements by 10–50x while preserving accuracy within acceptable bounds. Here is what actually works:

1. Quantisation

Quantisation converts model weights and activations from 32-bit floating point to lower-precision representations, typically INT8. This delivers roughly 4x memory reduction and 2–4x inference speedup on ARM NEON, with minimal accuracy loss for most sensor fusion tasks.

Post-Training Quantisation (PTQ): Apply quantisation after training using a small calibration dataset. Fast to implement, but can degrade accuracy on models with wide dynamic range.
Quantisation-Aware Training (QAT): Insert fake quantisation nodes during training so the model learns to be robust to reduced precision. More effort, but consistently better results.

Our results: QAT on our fall-detection CNN reduced model size from 12 MB to 3.1 MB and inference time from 420 ms to 95 ms on the A53, with less than 0.8% accuracy drop on our validation set.

2. Pruning

Pruning removes weights or entire filters that contribute least to the model's output. Structured pruning (removing entire channels or filters) is preferred for embedded targets because it produces genuinely smaller, faster models without requiring sparse-matrix support. Unstructured pruning (zeroing individual weights) can achieve higher compression ratios on paper but rarely translates to real speedup on ARM hardware without specialised sparse inference engines.

3. Knowledge Distillation and Architecture Search

Rather than compressing a large model, another approach is to train a small model from scratch, guided by a larger "teacher" model. We used knowledge distillation to train a lightweight MobileNet-v3-based student model that matched 97% of our original CNN's accuracy at one-fifth the compute. Neural Architecture Search (NAS) tools like Once-for-All can automate this process, but in practice, manual architecture design guided by profiling data was faster for our team.

Runtime Selection

Runtime	Strengths	Considerations
TensorFlow Lite	Mature, wide operator coverage, XNNPACK delegate for ARM	Larger binary footprint (~2 MB), Google ecosystem lock-in
ONNX Runtime	Framework-agnostic, growing ARM optimisation	Less mature on embedded Linux, fewer community examples

We chose TensorFlow Lite with the XNNPACK delegate for this project. The XNNPACK delegate leverages ARM NEON SIMD instructions and delivers consistent performance improvements of 2–3x over the default interpreter on Cortex-A53.

Sensor Fusion on Embedded Linux

Running a single ML model is one problem. Fusing data from multiple sensor modalities in real time, within a power budget, is a fundamentally harder one.

The Fusion Pipeline

Our system fuses three sensor modalities:

mmWave Radar (60 GHz): 3D point clouds at 20 fps — the primary sensing modality for presence detection, fall detection, and vital signs extraction.
IMU (6-axis): Accelerometer and gyroscope data at 100 Hz — used for device orientation compensation and as a secondary fall-detection signal.
Environmental sensors: Temperature, humidity, barometric pressure at 1 Hz — used for context (e.g., bathroom vs. bedroom affects fall risk priors).

The fusion pipeline architecture:

Radar (SPI, 20fps)  ──┐
                       ├──▶ Feature     ──▶ Fusion   ──▶ Classification ──▶ Alert
IMU   (I2C, 100Hz) ──┤     Extraction      Engine       (Fall/No-Fall)     Engine
                       │
Env   (I2C, 1Hz)   ──┘
                            ↓
                       Preprocessing:
                       - Point cloud clustering (DBSCAN)
                       - IMU Kalman filtering
                       - Env sensor moving average

Timing and Synchronisation

The biggest engineering challenge in multi-sensor fusion on embedded Linux is timing. Linux is not a real-time OS. Sensor data arrives asynchronously, and the scheduler can introduce jitter. Our approach:

Hardware timestamping: Each sensor driver captures a monotonic timestamp at the interrupt level, before any userspace scheduling delays.
Ring buffers: Each sensor writes to a lock-free ring buffer. The fusion thread reads from all buffers at the inference rate (20 Hz), selecting the most recent sample within a configurable time window.
PREEMPT_RT patch: We run the fusion thread on a PREEMPT_RT-patched kernel with SCHED_FIFO priority, reducing worst-case scheduling latency from ~10 ms to <500 µs.

Pipeline Latency Budget

Stage	Budget	Measured (P99)
Sensor acquisition	10 ms	8 ms
Preprocessing (DBSCAN + Kalman)	25 ms	18 ms
Feature extraction	15 ms	12 ms
ML inference (INT8, XNNPACK)	95 ms	88 ms
Post-processing & alert logic	5 ms	3 ms
Total	150 ms	129 ms

This comfortably meets our sub-200 ms end-to-end target with margin for worst-case conditions.

Power Management Strategies

Running ML inference 24/7 on a Cortex-A53 would consume around 2.5 W. For a battery-backed or thermally constrained device, that is unacceptable. Here is how we brought it down:

CPU frequency scaling: Use the ondemand or schedutil cpufreq governor to drop clock speed during idle periods. During active inference, we pin to 1.2 GHz (instead of the maximum 1.8 GHz) — the performance difference is marginal for INT8 workloads, but the power saving is significant.
Peripheral power gating: Disable unused peripherals (HDMI, USB, Ethernet PHY) at the regulator level. On our board, this alone saved 180 mW.
Inference duty cycling: Instead of running inference on every radar frame (20 fps), we use a lightweight presence-detection algorithm on the DSP that triggers full ML inference only when a person is detected in the room. This reduces inference duty cycle to ~30% in typical home use.
Memory bandwidth awareness: INT8 models are not just smaller — they consume less memory bandwidth. On the i.MX 8M Mini, memory access is one of the largest power consumers. Quantised models reduce DRAM traffic proportionally.
Suspend-to-RAM (S2R): When no presence is detected for 60 seconds, the system enters S2R with only the radar front-end and a wake-on-motion interrupt active. Resume latency is ~120 ms.

Result: Average system power dropped from 2.5 W to 1.4 W during active monitoring and 35 mW in Suspend-to-RAM — enabling operation on a modest 3,000 mAh backup battery for over 80 hours in standby.

Edge vs Cloud: It Is Not Binary

The best architectures use edge and cloud together, with a clear division of responsibilities:

Edge handles: Real-time inference, privacy-sensitive processing, offline operation, latency-critical alerts.
Cloud handles: Model retraining on aggregated (anonymised) data, fleet-wide analytics, OTA model updates, long-term trend analysis.
The boundary: Only derived features and event summaries leave the device — never raw sensor data. This is both a privacy design choice and a bandwidth optimisation.

Our system uploads daily activity summaries and model confidence metrics to the cloud. The cloud pipeline uses these to detect model drift and trigger retraining. Updated models are deployed via OTA — but the device is never dependent on cloud availability for its core function.

Real-World Lessons from Field Deployment

After 18 months of development and field deployment, here is what surprised us most:

Thermal throttling is your real enemy. Lab benchmarks mean nothing if the SoC throttles in a sealed enclosure at 40°C ambient. We had to redesign our thermal management (adding a copper heat spreader and strategic ventilation) after field units showed 30% inference slowdowns during summer months.
Field accuracy differs from lab accuracy. Our fall-detection model achieved 98.2% accuracy on our test dataset. In real-world deployment, it initially dropped to 91% due to environmental factors we had not captured — furniture layouts, pets, multiple occupants. Continuous data collection and retraining cycles were essential.
OTA model updates are non-negotiable. The ability to push updated models to deployed devices saved us multiple times. Design your system for model updates from day one — version the model, validate on-device before switching, and always keep a fallback model.
Regulatory compliance adds constraints. Medical-adjacent devices face additional scrutiny on software changes. Our OTA update process required a full verification and validation cycle for each model update, adding weeks to the iteration cycle. Plan for this.

Conclusion

Edge AI on a power budget is not a single problem — it is a stack of interdependent engineering decisions, from silicon selection to model architecture to power management to deployment infrastructure. The key insight is that optimisation happens at every layer, and the gains compound. A quantised model on duty-cycled inference with frequency-scaled CPU on a well-chosen SoC can deliver 10–20x power reduction compared to a naive deployment — often the difference between a viable product and a prototype that never ships.

The tools and techniques described here are not theoretical. They are running in production, in homes, monitoring real people. That is the promise of edge AI — not flashy demos, but quiet, reliable, power-efficient intelligence at the point of need.

Edge AI on a Power Budget: Running ML Models on Embedded Linux for Sensor Fusion