Production-grade MLOps infrastructure running on a 20-node Raspberry Pi cluster with custom thermal management and distributed storage.
This project originated from a requirement to simulate edge computing constraints—network partitions, hardware failures, and strict resource limits—within a physical environment. While cloud infrastructure abstracts these complexities, building resilient distributed systems requires confronting them directly.
The solution is a high-density compute cluster comprising 20 Raspberry Pi Compute Module 4 nodes, orchestrated by Talos Linux. Each node is equipped with NVMe storage and powered via PoE+, eliminating common bottlenecks associated with single-board computers. The entire infrastructure, including networking and management, occupies just 4.33U of rack space while operating silently in an office environment.
The cluster serves as the production backbone for my personal infrastructure, hosting RAG pipelines, vector databases, and distributed caching layers. It demonstrates that enterprise-grade architecture principles—immutable infrastructure, GitOps, and observability—can be effectively scaled down to edge hardware.
Office environment • ~400W compute power • Silent operation
Talos Kubernetes orchestrating microservices
Choosing physical hardware over virtualization was deliberate. Distributed consensus algorithms (Raft, Paxos) behave differently under real network latency. Troubleshooting physical node failures provides operational experience that managed Kubernetes services cannot replicate.
The implementation uses a unified PoE+ architecture, delivering both power and 1Gbps networking over a single cable per node. This reduces failure points and simplifies thermal management within the rack enclosure.
To support database workloads, I bypassed the USB bus entirely. Each CM4 node utilizes its single PCIe lane for NVMe storage, enabling high-throughput distributed block storage across the cluster.
Talos Linux was selected for its API-driven, immutable nature. Eliminating SSH and shell access reduces the attack surface and enforces declarative configuration management, aligning with modern GitOps practices.
The ARM architecture required a shift in CI/CD pipelines. I implemented multi-architecture build steps and, in several instances, contributed ARM64 support back to upstream open-source projects to enable deployment on the cluster.
With 8GB RAM per node, memory is a scarce resource. This necessitated strict Quality of Service (QoS) classes, aggressive horizontal autoscaling, and optimized JVM/runtime configurations for hosted services.
Hardware PWM control was incompatible with the immutable OS. I developed a custom Rust daemon that interfaces directly with the I2C bus to manage fan curves based on real-time thermal telemetry, maintaining optimal operating temperatures without OS-level dependencies.
Managing stateful workloads on ephemeral nodes required a robust storage layer. I deployed a replicated block storage system that ensures data locality where possible while guaranteeing consistency across node failures.