Replies: 2 comments
-
|
cc @mimowo Anyone you can loop in on this? |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
I can't answer 1-3. But for 4:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Team,
Firstly, thank you for maintaining and developing such a great product
We started to use Multikueue and want to expand across multiple clouds for gpu resource availability.
Setup
We have a MultiKueue setup with:
Deploymens are submitted on the manager and distributed to workers via MultiKueue.
Problem
Pods created on GKE manager contain GKE-specific fields that don't work on non-GKE workers:
What We Tried
We deployed a MutatingWebhook on worker clusters to:
Result
┌────────────────────────────────────────────┬───────────────────────────────────┐
│ Mutation Type │ Behavior │
├────────────────────────────────────────────┼───────────────────────────────────┤
│ schedulerName patch │ ✅ Works - pods progress normally │
├────────────────────────────────────────────┼───────────────────────────────────┤
│ readinessGates patch │ ✅ Works - pods progress normally │
├────────────────────────────────────────────┼───────────────────────────────────┤
│ WIF injection (volumes, env, volumeMounts) │ ❌ Pods terminate in 1-2 seconds │
└────────────────────────────────────────────┴───────────────────────────────────┘
Error Details
When WIF injection is enabled:
Workload transitions: Admitted → Finished in ~34ms
Pod terminates immediately after creation
Kueue controller logs show:
"prebuilt workload not found"
Root Cause Analysis
We believe Kueue creates a spec hash when the Workload is created on the manager. When our webhook on the worker modifies the Pod spec (adding volumes/env/volumeMounts), the spec no longer matches the prebuilt workload hash, causing Kueue to reject it.
Interestingly, schedulerName and readinessGates patches don't cause this issue - suggesting Kueue may exclude certain fields from the hash comparison.
Questions
- Our workers have different capabilities (GKE native WI vs WIF for non-GKE)
- We need different volume mounts depending on target cluster
- Manager-side webhook: Inject cluster-specific spec before Kueue processes the Job (based on target queue label)
- Separate Jobs: Different Job specs per cluster type
Is there a better/native way to handle this?
Thanks again !
Beta Was this translation helpful? Give feedback.
All reactions