-
Notifications
You must be signed in to change notification settings - Fork 1.5k
DRA: CPU Placement Bit String #5213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
/assign @catblade |
@johnbelamaric: GitHub didn't allow me to assign the following users: catblade. Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/cc |
I'm very very interested in this direction, so I have random initial musings/question, I think there will be more clarity/details at the maintainer summit session
thanks! |
@wickberg or @catblade will have a more definitive answer, but my understanding is it is a literal bitstring. But it's up to us really.
Both.
we won't take that away - it's part of the API. but maybe its uses will reduce. It still has uses for other things, but maybe it won't be needed for CPU alignment We might be able to achieve this without an API changes at first, at least for "best effort". |
Enhancement Description
One of the key goals of DRA is to help with alignment of devices in the intra-node topology. In the current DRA incarnation, this is done via the
matchAttributes
constraint. This allows you to require that specified attributes for all devices satisfying the request have the same value. For example, the NIC and GPU that are selected must share the same PCIe root complex.One issue with this is that it does not include CPU alignment. Another issue is that intra-node topologies vary widely and are often quite complex; simple attribute matching may not be sufficient in many cases. It also requires a fair bit of knowledge on the part of users as to how to align these different devices.
The approach taken by Slurm, an HPC scheduler, is a bit different. Rather than requiring an understanding of specific attributes, Slurm calculates a standardize CPU-placement bit string for every device. That is, it normalizes based up on the number of CPUs in the node, and publishes a bit string for each device that represents which CPU(s) that device is aligned with. This localizes the alignment logic to the node, rather than requiring users to understand it in depth.
This bit string is calculable based on data in /proc and similar places on Linux machines, and the folks building Slurm have agreed to develop the necessary code to calculate this bit string in OSS to share with Kubernetes. This common library could be used by DRA plugin authors to publish the common placement bit string, and an alignment constraint option added to the ResourceClaim constraints. This will allow users to require alignment between their devices without having to understand the details of intra-node topology. It would also allows us to optimally align by default (that is, without the user asking, we would align if possible; with the constraint, we would fail scheduling if we cannot align).
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s):Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.
The text was updated successfully, but these errors were encountered: