Back to Projects

Distributed Edge Cluster

Production-grade MLOps infrastructure running on a 20-node Raspberry Pi cluster with custom thermal management and distributed storage.

Raspberry Pi CM4KubernetesTalos LinuxRustNVMePoE+MLOpsDistributed Systems

Architectural Overview

This project originated from a requirement to simulate edge computing constraints—network partitions, hardware failures, and strict resource limits—within a physical environment. While cloud infrastructure abstracts these complexities, building resilient distributed systems requires confronting them directly.

The solution is a high-density compute cluster comprising 20 Raspberry Pi Compute Module 4 nodes, orchestrated by Talos Linux. Each node is equipped with NVMe storage and powered via PoE+, eliminating common bottlenecks associated with single-board computers. The entire infrastructure, including networking and management, occupies just 4.33U of rack space while operating silently in an office environment.

The cluster serves as the production backbone for my personal infrastructure, hosting RAG pipelines, vector databases, and distributed caching layers. It demonstrates that enterprise-grade architecture principles—immutable infrastructure, GitOps, and observability—can be effectively scaled down to edge hardware.

System Architecture

Physical Rack Layout (4.33U Total)

Office environment • ~400W compute power • Silent operation

Network Layer - 2U

UniFi
Dream Machine Pro (1U)
Router • Gateway • Controller • IDS/IPS
USW Pro 24 PoE (1U)
24-port PoE+ switch • Powers all 20 Pi nodes

Management Layer - 1.33U

Racknex Mount
Intel NUC #1
Proxmox Host
Intel NUC #2
Proxmox Host
Intel NUC #3
Proxmox Host
Ubuntu VM Jumpbox
kubectl • talosctl • GitOps • CI/CD

Compute Layer - 1U

Compute Blade
Pi 1
Pi 2
Pi 3
Pi 4
Pi 5
Pi 6
Pi 7
Pi 8
Pi 9
Pi 10
Pi 11
Pi 12
Pi 13
Pi 14
Pi 15
Pi 16
Pi 17
Pi 18
Pi 19
Pi 20
CM4 Modules
ARM64 • 8GB RAM
Storage
1TB NVMe per node
Power
PoE+ per node
Cooling
Noctua + Rust Control

Software Stack

Talos Kubernetes orchestrating microservices

MLOps Services

  • • RAG Pipelines
  • • Vector Databases (pgvector)
  • • Embedding Generation
  • • Model Serving
  • • LLM Orchestration

Data Layer

  • • PostgreSQL (HA)
  • • Redis (Caching/PubSub)
  • • MinIO (S3-compatible)
  • • Distributed Block Storage
  • • 20TB Total NVMe

Observability

  • • Prometheus (Metrics)
  • • Grafana (Dashboards)
  • • Distributed Tracing
  • • AlertManager
  • • Custom Thermal Monitor

Network Architecture

  • • UniFi Dream Machine Pro (routing/gateway)
  • • USW Pro 24 PoE (24-port switch)
  • • Single cable per Pi (PoE+ power + data)
  • • Kubernetes CNI for pod networking
  • • Service mesh & network policies

Power & Thermal

  • • ~400W total power consumption
  • • PoE+ budget management (~25W/port)
  • • Custom Rust thermal controller
  • • Noctua fan curves (0-100% PWM)
  • • Silent operation in office environment

Engineering Decisions

01

Physicality & Edge Constraints

Choosing physical hardware over virtualization was deliberate. Distributed consensus algorithms (Raft, Paxos) behave differently under real network latency. Troubleshooting physical node failures provides operational experience that managed Kubernetes services cannot replicate.

02

Power & Networking Efficiency

The implementation uses a unified PoE+ architecture, delivering both power and 1Gbps networking over a single cable per node. This reduces failure points and simplifies thermal management within the rack enclosure.

03

Storage I/O Strategy

To support database workloads, I bypassed the USB bus entirely. Each CM4 node utilizes its single PCIe lane for NVMe storage, enabling high-throughput distributed block storage across the cluster.

04

Immutable OS Architecture

Talos Linux was selected for its API-driven, immutable nature. Eliminating SSH and shell access reduces the attack surface and enforces declarative configuration management, aligning with modern GitOps practices.

Technical Challenges

ARM64 Compatibility

The ARM architecture required a shift in CI/CD pipelines. I implemented multi-architecture build steps and, in several instances, contributed ARM64 support back to upstream open-source projects to enable deployment on the cluster.

Resource Constraints

With 8GB RAM per node, memory is a scarce resource. This necessitated strict Quality of Service (QoS) classes, aggressive horizontal autoscaling, and optimized JVM/runtime configurations for hosted services.

Thermal Control Plane

Hardware PWM control was incompatible with the immutable OS. I developed a custom Rust daemon that interfaces directly with the I2C bus to manage fan curves based on real-time thermal telemetry, maintaining optimal operating temperatures without OS-level dependencies.

Distributed Storage

Managing stateful workloads on ephemeral nodes required a robust storage layer. I deployed a replicated block storage system that ensures data locality where possible while guaranteeing consistency across node failures.

Operational Metrics

4.33U
Total rack space
20TB
Distributed NVMe storage
~400W
Compute power draw
2U UniFi networking + 1.33U management (3 NUCs) + 1U Pi cluster (20 nodes)