CosmosDBChangeFeed occasionally pushes feed to multiple workers instead of distributing the feed #132

sthayamkery · 2019-01-31T17:30:52Z

We are using the ChangeFeed SDK to process data from Cosmos DB. Sometimes the same feed goes to multiple workers resulting in work being duplicated. From our logs we have tracked this issue down to where a Microsoft.Azure.Documents.ChangeFeedProcessor.Exceptions.LeaseLostException is occurring. Right after the exception occurs there is a race condition happening and multiple workers end up working on the same feed.
We tried with version 2.2.5 and 2.2.6. The issue is still persisting.

I have created a sample application in [github] https://github.com/sthayamkery/CosmosChangeFeedProcessorBug)
You can see the logs recording the following
[15:08:07 INF] Partition 0 lease update conflict. Reading the current version of lease.
[15:08:07 INF] Partition 0 update failed because the lease with token '"00000000-0000-0000-b8de-fd1b992201d4"' was updated by host 'UserDataProcessor' with token '"00000000-0000-0000-b8df-6a2fc72501d4"'. Will retry, 5 retry(s) left.

sthayamkery · 2019-02-01T20:04:06Z

Any contributors available to look into this?

bartelink · 2019-02-01T21:16:01Z

Can you clarify how long this concurrent processing continues please?

(While there may be a case to answer here in terms of this representing a behavior that can be handled more correctly, I'd venture that this is a secondary issue: the bottom line is that any code sitting in a ChangeFeedProcessor needs to be able to deal with such an occurrence (at least once delivery) by being idempotent.

e.g. this is no different to if a processor lost contact with the lease store and got usurped at the end of the lease TTL, but then regained contact and attempted to renew the lease -- its unavoidable that two or more processors can end up in cases where there is duplicate (or even parallel) processing

The key major guarantee you do have, is a) at least once delivery b) if something is repeated, it'll always play forward from that point until the point when it's back in sync.

Example ways of dealing with this might be:

(for forwarding scenarios), have the bus do message deduplication
for email sending - use a Reservation Pattern to manage the synchronisation in a way to achieve closer to at-most-once delivery semantics)

sthayamkery · 2019-02-04T03:50:42Z

When we had a 3 minute activity in CosmosDB trigerring a change feed there were leaselostexceptions being recorded for a 35 second window.

bartelink · 2019-02-04T07:46:41Z

had a 3 minute activity in CosmosDB trigerring a change feed

Can you expand on this "had a 3 minute activity?" (wondering how this is relevant given AIUI the CFP is an async thing that's going all the time - lease losses happen either because a) progress in the entire process was impeded and the lease expires b) it's an actual transfer of ownership c) e.g. the process dies or loses contact)

a 35 second window can definitely happen (e.g. in the normal handoff when starting a second CFP you'll see that?

rafamerlin · 2019-03-15T04:03:06Z

Should these lease with token issues happen if we have a single instance of the ChangeFeed processor running?

Ignoring redeployments, I have a solution using ChangeFeed that has been running for a while, most of the time it is running a single instance of it and checking my app insights I've been getting between 1.5 and 2.5k of the lease issues a day.

Let's say I have 10k records in my cosmosdb that were processed by host abcdefg-1234-1234-1234-abcdefg and I redeploy my app, so this first host will get removed and a new one newone-1234-1234-1234-abcdefg will be created. Does that mean that every single record that has changes done to it will have this lease conflict? As the last time it was changed it was in a different host?

Or am I just confused with the concepts? I'm asking this here as I coudn't find any other mention of this issue anywhere else.

bartelink · 2019-03-15T07:24:59Z

(I'm only another consumer but am very interested in seeing these sorts of tradeoffs documented #124, #125 and more)
Can you edit in some info regarding "of the lease issues" - this is very vague; there are natural reasons for leases to transition etc and you need to be a lot clearer for your comment to be helpful.

Wrt deploys and the hostname - can you add some info as to how you derive the hostname in your context ? e.g. in my context, it's this algo - the hostname is simply an id that should uniquely identify a running instance, mainly for troubleshooting purposes so you can kill it etc. Whenever you have multiple instances running, there's a natural rebalancing that takes place a) as ranges leases expire b) as extra processor instances are added - the winning lessor stamps their id onto the row. => You need to explain how you derive the value you pass to WithHostName

Which brings me to my next question:- what is "this lease conflict" of which you speak? Is it purely the occurrence of that log entry ? If you and/or the OP are claiming there is a bug or weird behavior, this needs to be clearly stated. If this is instead a doc-request, state that.

I'd dearly love to see a section in the README.md regarding the intended pattern and/or what to expect here - so, maintainers, if you're watching - consider me to be asking this question as a doc request without any implication that I believe the behavior to be incorrect. (One might even say it can't be incorrect if it's not spec'd 😁 )

rafamerlin · 2019-03-18T00:28:43Z

Hey @bartelink , my bad for not being clearer, thanks for calling my attention to that.

I mean I'm having the same Log messages, I haven't had any concurrency issue as most of the times I'm running a single ChangeFeed instance.

The lease issue I mention is the log entries:
Partition 0 lease update conflict. Reading the current version of lease.
And
Partition 0 update failed because the lease with token '"xxx"' was updated by host 'zzz' with token '"yyy"'. Will retry, 5 retry(s) left.

Things for me are working fine. I would just like some information on why this happens and if it is expected or not. Considering I am not deploying daily and I'm running 1 instance of the Change Feed, I'm wondering why I am still getting these messages as there shouldn't be any other consumer trying to get the lease. I though the update conflicts should stop after a while running a new ChangeFeed instance (after a deployment for example).

yahorsi · 2019-04-12T09:53:53Z

+1 for this kind of things must be documented

ealsur · 2019-10-04T14:46:28Z

LeaseLostException means that another instance has taken that lease. It could be due to load balancing because the # of Hosts is changing.

During this transition, if the original Host was processing a particular batch, then that same batch might be picked by the Host that just got the lease after the re-balance.

CFP offers an "at least once" delivery because of this, and the fact that any errors inside the ProcessChangesAsync code would make the same batch be retried.

ealsur closed this as completed Oct 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CosmosDBChangeFeed occasionally pushes feed to multiple workers instead of distributing the feed #132

CosmosDBChangeFeed occasionally pushes feed to multiple workers instead of distributing the feed #132

sthayamkery commented Jan 31, 2019

sthayamkery commented Feb 1, 2019

bartelink commented Feb 1, 2019

sthayamkery commented Feb 4, 2019

bartelink commented Feb 4, 2019

rafamerlin commented Mar 15, 2019 •

edited

Loading

bartelink commented Mar 15, 2019

rafamerlin commented Mar 18, 2019

yahorsi commented Apr 12, 2019

ealsur commented Oct 4, 2019

CosmosDBChangeFeed occasionally pushes feed to multiple workers instead of distributing the feed #132

CosmosDBChangeFeed occasionally pushes feed to multiple workers instead of distributing the feed #132

Comments

sthayamkery commented Jan 31, 2019

sthayamkery commented Feb 1, 2019

bartelink commented Feb 1, 2019

sthayamkery commented Feb 4, 2019

bartelink commented Feb 4, 2019

rafamerlin commented Mar 15, 2019 • edited Loading

bartelink commented Mar 15, 2019

rafamerlin commented Mar 18, 2019

yahorsi commented Apr 12, 2019

ealsur commented Oct 4, 2019

rafamerlin commented Mar 15, 2019 •

edited

Loading