Dorado's Cruz AI Fabric Controller is a cutting-edge orchestration and management tool designed to streamline and optimize AI workloads, particularly those leveraging RoCEv2 (RDMA over Converged Ethernet v2). With its robust capabilities, this controller ensures that Ethernet Networks are tuned and optimized to deliver lossless connectivity for demanding AI applications requiring RDMA.
Let’s explore how it achieves this.
Key Capabilities for Managing RoCEv2 Networks
1. AI Fabric Orchestration
Cruz AI Fabric Controller supports a variety of fabric topologies, including CLOS, rail, and custom configurations, ensuring seamless integration with RoCEv2’s layer 3 routing capabilities. It automates the design, deployment, and operation of both backend and frontend fabrics, simplifying the complexities of managing RoCEv2-enabled networks.
2. Congestion Control
Network congestion can be a major bottleneck for AI workloads. Cruz implements adaptive congestion control mechanisms to mitigate these challenges, ensuring high-speed, lossless connectivity. By dynamically addressing network congestion, it enhances GPU utilization—an essential feature for RoCEv2’s performance in AI training and inference.
3. Multi-Tenancy
In multi-tenant environments, Cruz AI Fabric Controller introduces a "Class of Tenant" feature to prioritize jobs based on enterprise business practices. It enables the partitioning of infrastructure assets, such as GPUs, across multiple users or tenants. Resources are dynamically allocated and deallocated in real-time, optimizing RoCEv2’s efficiency for multi-tenant AI workloads.
4. Integration with SONiC
The controller leverages Software for Open Networking in the Cloud (SONiC) to provide robust support for RoCEv2 deployments. Features like DCQCN (Data Center Quantized Congestion Notification) ensure effective congestion management, enabling reliable and efficient Ethernet-based AI fabric operations.
5. AI Workload Management
By integrating with popular job schedulers, Cruz AI Fabric Controller optimizes RoCEv2 bandwidth based on AI job requirements, tenant class, available bandwidth, and congestion levels. It automates complex network tuning, reducing manual intervention and minimizing errors.
6. Resource Optimization
Advanced algorithms intelligently allocate resources, ensuring RoCEv2 networks operate at peak efficiency. This maximizes GPU and network asset utilization, aligning with enterprise AI strategies and delivering optimal performance.
Real-World Use Cases
Cruz AI Fabric Controller’s capabilities shine in a variety of AI scenarios:
- AI Training and Inference: It manages inter-GPU networking demands, ensuring high-performance AI training and inference workloads run smoothly.
- Distributed Edge AI: Cruz supports RoCEv2 in distributed edge AI inference scenarios, enabling seamless operations across subnets.
- Enterprise AI Data Centers: Whether for centralized or distributed AI architectures, the controller optimizes RoCEv2 networks to meet the demands of enterprise-scale AI.
Why It Matters
As AI workloads grow in complexity and scale, the need for efficient, scalable, and adaptable network management becomes critical. Dorado's Cruz AI Fabric Controller addresses these challenges head-on, ensuring that RoCEv2 networks deliver the performance and reliability required for modern AI applications.
By integrating advanced orchestration, congestion control, multi-tenancy, and resource optimization features, Cruz AI Fabric Controller empowers enterprises to unlock the full potential of their AI infrastructure. Whether you’re managing massive AI training clusters or deploying edge AI solutions, this tool ensures your RoCEv2 networks are ready to meet the demands of tomorrow’s AI workloads.
Ready to learn more?
Discover how Cruz AI Fabric Controller can transform your RoCEv2 networks for peak performance and scalability.
👉 Let's Get Started Today!