client

Cerberus

Code Zero Data LTD
Project image
Project image
Project image

Description

Project Cerberus is a 3-node on-premises server cluster engineered by Code Zero to make AI, ML and HPC workloads accessible beyond large enterprises. Each node is built on a Dell PowerEdge R750 (2U) with dual Intel Platinum 8352Y processors, 256GB RAM, 20TB NVMe SSD and dual NVIDIA A40 GPUs, bringing the total cluster cost to approximately £90k compared to the £750k-£5m price tag of traditional platforms. The cluster supports two software stacks: a lightweight open-source stack (Ubuntu Linux, Docker, Prometheus, Grafana) using GPU passthrough, and a full enterprise stack using VMware vSphere with vGPU partitioning, VMware vSAN for software-defined storage, VMware Tanzu for Kubernetes management, VMware Horizon for VDI, VMware Aria for monitoring, and NVIDIA AI Enterprise Suite. The three-node design enables a reverse centralised microservices approach: one node for GPU model training, one for inference, and one for Data Scientist and Data Engineer VDI sessions. Cerberus was designed natively for immersion cooling and represents a strategic collaboration with GRC, Castrol ON (part of bp Castrol), and Evoque-Cyxtera, one of the largest data centre colocation providers in the world.

Responsibilities

  • Infrastructure design and implementation.
  • Experiments runs preparation and execution.
  • Metrics gathering, analysis and visualization.

Role

Lead

Technologies

python python
tensorFlow tensorFlow
keras keras
pyTorch pyTorch
AWS AWS
docker docker
prometheus prometheus
grafana grafana

Links

Back to Portfolio