-->

Register Now to SAVE BIG & Join Us for Enterprise AI World 2025, November 19-20, in Washington, DC

NVIDIA Announces Open Source Run:ai Scheduler to Foster Community Collaboration on Enterprise AI

NVIDIA announced the open-source release of the KAI Scheduler, a Kubernetes-native GPU scheduling solution, now available under the Apache 2.0 license—underscoring NVIDIA’s commitment to advancing both open-source and enterprise AI infrastructure, fostering an active and collaborative community, encouraging contributions, feedback, and innovation.

Originally developed within the Run:ai platform, KAI Scheduler is now available to the community while also continuing to be packaged and delivered as part of the NVIDIA Run:ai platform.   

Managing AI workloads on GPUs and CPUs presents a number of challenges that traditional resource schedulers often fail to meet. The scheduler was developed to specifically address these issues:

  • Managing fluctuating GPU demands
  • Reduced wait times for compute access
  • Resource guarantees or GPU allocation
  • Seamlessly connecting AI tools and frameworks
  • Managing fluctuating GPU demands

The KAI Scheduler continuously recalculates fair-share values and adjusts quotas and limits in real time, automatically matching the current workload demands. This dynamic approach helps ensure efficient GPU allocation without constant manual intervention from administrators.

For ML engineers, time is of the essence, NVIDIA said. The scheduler reduces wait times by combining gang scheduling, GPU sharing, and a hierarchical queuing system that enables you to submit batches of jobs and then step away, confident that tasks will launch as soon as resources are available and in alignment of priorities and fairness.

In shared clusters, some researchers secure more GPUs than necessary early in the day to ensure availability throughout. This practice can lead to underutilized resources, even when other teams still have unused quotas.

KAI Scheduler addresses this by enforcing resource guarantees. It ensures that AI practitioner teams receive their allocated GPUs, while also dynamically reallocating idle resources to other workloads. This approach prevents resource hogging and promotes overall cluster efficiency.

Connecting AI workloads with various AI frameworks can be daunting. Traditionally, teams face a maze of manual configurations to tie together workloads with tools such as Kubeflow, Ray, Argo, and the Training Operator. This complexity delays prototyping, according to the vendor.

KAI Scheduler addresses this by featuring a built-in podgrouper that automatically detects and connects with these tools and frameworks—reducing configuration complexity and accelerating development.

The scheduling process diagram shows a workload assigned to a queue, which generates pods grouped into a podgroup. The podgroup is sent to the scheduler, which considers workloads from multiple queues. The scheduler operates in an infinite loop, performing a series of steps: taking a cluster snapshot (GPUs and CPUs), dividing resources, executing scheduling actions (allocation, consolidation, reclamation, and preemption), and updating the cluster status. The system ensures efficient resource allocation and management within defined execution orders.

KAI Scheduler isn’t just a prototype. It’s the robust engine at the heart of the NVIDIA Run:ai platform, trusted by many enterprises and powering critical AI operations. With its proven track record, the KAI Scheduler sets the gold standard for AI workload orchestration, according to NVIDIA.

For more information about this news, visit https://developer.nvidia.com.

EAIWorld Cover
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues