Dorado's Cruz AI Fabric Controller is a cutting-edge orchestration and management tool designed to streamline and optimize AI workloads, particularly those leveraging RoCEv2 (RDMA over Converged Ethernet v2). With its robust capabilities, this controller ensures that Ethernet Networks are tuned and optimized to deliver lossless connectivity for demanding AI applications requiring RDMA.
Let’s explore how it achieves this.
Cruz AI Fabric Controller supports a variety of fabric topologies, including CLOS, rail, and custom configurations, ensuring seamless integration with RoCEv2’s layer 3 routing capabilities. It automates the design, deployment, and operation of both backend and frontend fabrics, simplifying the complexities of managing RoCEv2-enabled networks.
Network congestion can be a major bottleneck for AI workloads. Cruz implements adaptive congestion control mechanisms to mitigate these challenges, ensuring high-speed, lossless connectivity. By dynamically addressing network congestion, it enhances GPU utilization—an essential feature for RoCEv2’s performance in AI training and inference.
In multi-tenant environments, Cruz AI Fabric Controller introduces a "Class of Tenant" feature to prioritize jobs based on enterprise business practices. It enables the partitioning of infrastructure assets, such as GPUs, across multiple users or tenants. Resources are dynamically allocated and deallocated in real-time, optimizing RoCEv2’s efficiency for multi-tenant AI workloads.
The controller leverages Software for Open Networking in the Cloud (SONiC) to provide robust support for RoCEv2 deployments. Features like DCQCN (Data Center Quantized Congestion Notification) ensure effective congestion management, enabling reliable and efficient Ethernet-based AI fabric operations.
By integrating with popular job schedulers, Cruz AI Fabric Controller optimizes RoCEv2 bandwidth based on AI job requirements, tenant class, available bandwidth, and congestion levels. It automates complex network tuning, reducing manual intervention and minimizing errors.
Advanced algorithms intelligently allocate resources, ensuring RoCEv2 networks operate at peak efficiency. This maximizes GPU and network asset utilization, aligning with enterprise AI strategies and delivering optimal performance.
Cruz AI Fabric Controller’s capabilities shine in a variety of AI scenarios:
As AI workloads grow in complexity and scale, the need for efficient, scalable, and adaptable network management becomes critical. Dorado's Cruz AI Fabric Controller addresses these challenges head-on, ensuring that RoCEv2 networks deliver the performance and reliability required for modern AI applications.
By integrating advanced orchestration, congestion control, multi-tenancy, and resource optimization features, Cruz AI Fabric Controller empowers enterprises to unlock the full potential of their AI infrastructure. Whether you’re managing massive AI training clusters or deploying edge AI solutions, this tool ensures your RoCEv2 networks are ready to meet the demands of tomorrow’s AI workloads.
Ready to learn more?
Discover how Cruz AI Fabric Controller can transform your RoCEv2 networks for peak performance and scalability.