-
Notifications
You must be signed in to change notification settings - Fork 22
CosmosDBChangeFeed occasionally pushes feed to multiple workers instead of distributing the feed #132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Any contributors available to look into this? |
Can you clarify how long this concurrent processing continues please? (While there may be a case to answer here in terms of this representing a behavior that can be handled more correctly, I'd venture that this is a secondary issue: the bottom line is that any code sitting in a ChangeFeedProcessor needs to be able to deal with such an occurrence (at least once delivery) by being idempotent. e.g. this is no different to if a processor lost contact with the lease store and got usurped at the end of the lease TTL, but then regained contact and attempted to renew the lease -- its unavoidable that two or more processors can end up in cases where there is duplicate (or even parallel) processing The key major guarantee you do have, is a) at least once delivery b) if something is repeated, it'll always play forward from that point until the point when it's back in sync. Example ways of dealing with this might be:
|
When we had a 3 minute activity in CosmosDB trigerring a change feed there were leaselostexceptions being recorded for a 35 second window. |
Can you expand on this "had a 3 minute activity?" (wondering how this is relevant given AIUI the CFP is an async thing that's going all the time - lease losses happen either because a) progress in the entire process was impeded and the lease expires b) it's an actual transfer of ownership c) e.g. the process dies or loses contact) a 35 second window can definitely happen (e.g. in the normal handoff when starting a second CFP you'll see that? |
Should these lease with token issues happen if we have a single instance of the ChangeFeed processor running? Ignoring redeployments, I have a solution using ChangeFeed that has been running for a while, most of the time it is running a single instance of it and checking my app insights I've been getting between 1.5 and 2.5k of the lease issues a day. Let's say I have 10k records in my cosmosdb that were processed by host Or am I just confused with the concepts? I'm asking this here as I coudn't find any other mention of this issue anywhere else. |
(I'm only another consumer but am very interested in seeing these sorts of tradeoffs documented #124, #125 and more) Wrt deploys and the hostname - can you add some info as to how you derive the hostname in your context ? e.g. in my context, it's this algo - the hostname is simply an id that should uniquely identify a running instance, mainly for troubleshooting purposes so you can kill it etc. Whenever you have multiple instances running, there's a natural rebalancing that takes place a) as ranges leases expire b) as extra processor instances are added - the winning lessor stamps their id onto the row. => You need to explain how you derive the value you pass to Which brings me to my next question:- what is "this lease conflict" of which you speak? Is it purely the occurrence of that log entry ? If you and/or the OP are claiming there is a bug or weird behavior, this needs to be clearly stated. If this is instead a doc-request, state that. I'd dearly love to see a section in the README.md regarding the intended pattern and/or what to expect here - so, maintainers, if you're watching - consider me to be asking this question as a doc request without any implication that I believe the behavior to be incorrect. (One might even say it can't be incorrect if it's not spec'd 😁 ) |
Hey @bartelink , my bad for not being clearer, thanks for calling my attention to that. I mean I'm having the same Log messages, I haven't had any concurrency issue as most of the times I'm running a single ChangeFeed instance. The lease issue I mention is the log entries: Things for me are working fine. I would just like some information on why this happens and if it is expected or not. Considering I am not deploying daily and I'm running 1 instance of the Change Feed, I'm wondering why I am still getting these messages as there shouldn't be any other consumer trying to get the lease. I though the update conflicts should stop after a while running a new ChangeFeed instance (after a deployment for example). |
+1 for this kind of things must be documented |
LeaseLostException means that another instance has taken that lease. It could be due to load balancing because the # of Hosts is changing. During this transition, if the original Host was processing a particular batch, then that same batch might be picked by the Host that just got the lease after the re-balance. CFP offers an "at least once" delivery because of this, and the fact that any errors inside the |
We are using the ChangeFeed SDK to process data from Cosmos DB. Sometimes the same feed goes to multiple workers resulting in work being duplicated. From our logs we have tracked this issue down to where a Microsoft.Azure.Documents.ChangeFeedProcessor.Exceptions.LeaseLostException is occurring. Right after the exception occurs there is a race condition happening and multiple workers end up working on the same feed.
We tried with version 2.2.5 and 2.2.6. The issue is still persisting.
I have created a sample application in [github] https://github.com/sthayamkery/CosmosChangeFeedProcessorBug)
You can see the logs recording the following
[15:08:07 INF] Partition 0 lease update conflict. Reading the current version of lease.
[15:08:07 INF] Partition 0 update failed because the lease with token '"00000000-0000-0000-b8de-fd1b992201d4"' was updated by host 'UserDataProcessor' with token '"00000000-0000-0000-b8df-6a2fc72501d4"'. Will retry, 5 retry(s) left.
The text was updated successfully, but these errors were encountered: