-
Notifications
You must be signed in to change notification settings - Fork 691
Use LinearCache to optimize StreamEndpoint discovery. #6906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
8f2cf3c
to
7bc7a34
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6906 +/- ##
==========================================
+ Coverage 80.72% 80.81% +0.09%
==========================================
Files 131 131
Lines 19868 19857 -11
==========================================
+ Hits 16039 16048 +9
+ Misses 3537 3518 -19
+ Partials 292 291 -1
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before I review, we are extremely interested in this but is there a way to have this behind a feature flag?
Coming from a bit of ignorance but could contour be updated in place and things should just work or would it require to restart all the envoy pods?
I have not worked with go-control-plane and xDS subscription versioning details, so this should be carefully reviewed. I'd appreciate extra eyes on this. If necessary, we can add a feature flag, but I'm not sure if it is needed - see below.
I don't believe Envoy pods need to be restarted. As far as I understand, Envoys are completely unaware of the algorithm the server uses; they simply return the last received version info to the server. I'm still working on fully understanding the difference between the cache implementations. I created https://github.com/tsaarni/grpc-json-sniffer to gain more insight into this issue. |
I've ran some test scenarios and documented them here https://gist.github.com/tsaarni/db319d5d9935d18f8856fcdd9b2a89ae The go-control-plane cache implementations have some details that might be interesting to study as well |
The Contour project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
The Contour project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
Signed-off-by: Tero Saarni <[email protected]>
7bc7a34
to
9ab7ac8
Compare
I've updated the PR description to provide a clearer explanation of the issue. I'd really appreciate any reviews when you have time. Thanks! |
The Contour project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to the #contour channel in the Kubernetes Slack |
This change attempts to improve performance in clusters with a large number of endpoints, as discussed in #6743 (comment).
Envoy does not send a wildcard
DiscoveryRequest
(a request without a resource name) for EDS /ClusterLoadAssignment
resources. Instead, it creates a separate EDS stream for each CDS entry and requests specific resource by name. For example, with 10000 upstream clusters, each Envoy instance sends 10000DiscoveryRequests
, one per endpoint, e.g. fromechoserver-0000
toechoserver-9999
.As a result, the SotW-style update, where full set of resources is sent at every update, is not applicable. If
echoserver-0000
changes, updates should not be sent to streams watchingechoserver-0001
-echoserver-9999
. SotW update should consists of single update, forechoserver-0000
. Effectively, EDS behaves like incremental update mechanism, since each endpoint has its own stream / watch.Using
SnapshotCache
is problematic in this scenario because it broadcastsDiscoveryResponse
update to all EDS streams whenever any endpoint changes. Even if onlyechoserver-0000
is updated,SnapshotCache
will send 10000 updates, fromechoserver-0000
toechoserver-9999
to each Envoy instance. In each of these updates Contour sendsDiscoveryResponse
, followed by Envoy immediately sending newDiscoveryRequests
back to Contour to watch further updates. Since these messages are relatively heavy-weight, this creates unnecessary overhead compared to typical SotW update.This PR replaces
SnapshotCache
withLinearCache
for EDS.LinearCache
addresses the issue by tracking which stream requested which resource and using versioning to ensure that updates are sent only to streams watching the specific endpoints that changed. Whenechoserver-0000
is updated, only the EDS streams watchingechoserver-0000
will receive the update.The
LinearCache
was previously considered but not adopted due to complications outlined by @skriss in a prior PRThis PR attempts to mitigate this by generating unique version prefix at each startup.
Fixes #6743