Skip to content

Attributable failures (feature 36/37) #1044

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

joostjager
Copy link
Collaborator

@joostjager joostjager commented Nov 21, 2022

Failure attribution is important to properly penalize nodes after a payment failure occurs. The goal of the penalty is to give the next attempt a better chance at succeeding. In the happy failure flow, the sender is able to determine the origin of the failure and penalizes a single node or pair of nodes.

Unfortunately it is possible for nodes on the route to hide themselves. If they return random data as the failure message, the sender won't know where the failure happened.

This PR proposes a new failure message format that lets each node commit to the failure message. If one of the nodes corrupts the failure message, the sender will be able to identify that node.

For more information, see https://lists.linuxfoundation.org/pipermail/lightning-dev/2022-October/003723.html.

LND implementation: lightningnetwork/lnd#7139

LDK implementation: lightningdevkit/rust-lightning#3611

Eclair implementation: ACINQ/eclair#2519

@thomash-acinq
Copy link

I've started implementing it in eclair, do you have some test vectors so we can check that we are compatible?
The design seems good to me, but as I've said previously, I think keeping hop payloads and hmacs for 8 nodes only (instead of 27) is enough for almost all use cases and would give us huge size savings.

@joostjager
Copy link
Collaborator Author

joostjager commented Dec 6, 2022

I don't have test vectors yet, but I can produce them. Will add them to this PR when ready.

Capping the max hops at a lower number is fine to me, but do you have a scenario in mind where this would really make the difference? Or is it to more generally that everything above 8 is wasteful?

@joostjager joostjager force-pushed the fat-errors branch 2 times, most recently from 4b48481 to 24b10d5 Compare December 6, 2022 16:52
@joostjager
Copy link
Collaborator Author

@thomash-acinq added a happy fat error test vector.

09-features.md Outdated
@@ -41,6 +41,7 @@ The Context column decodes as follows:
| 20/21 | `option_anchor_outputs` | Anchor outputs | IN | `option_static_remotekey` | [BOLT #3](03-transactions.md) |
| 22/23 | `option_anchors_zero_fee_htlc_tx` | Anchor commitment type with zero fee HTLC transactions | IN | `option_static_remotekey` | [BOLT #3][bolt03-htlc-tx], [lightning-dev][ml-sighash-single-harmful]|
| 26/27 | `option_shutdown_anysegwit` | Future segwit versions allowed in `shutdown` | IN | | [BOLT #2][bolt02-shutdown] |
| 28/29 | `option_fat_error` | Can generate/relay fat errors in `update_fail_htlc` | IN | | [BOLT #4][bolt04-fat-errors] |
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this big gap in the bits has emerged here because of tentative spec changes that may or may not make it. Not sure why that is necessary. I thought for unofficial extensions, the custom range is supposed to be used?

I can see that with unofficial features deployed in the wild, it is easier to keep the same bit when something becomes official. But not sure if that is worth creating the gap here? An alternative is to deploy unofficial features in the custom range first, and then later recognize both the official and unofficial bit. Slightly more code, but this feature list remains clean.

@joostjager
Copy link
Collaborator Author

Added fat error signaling to the PR.

@thomash-acinq
Copy link

I've spent a lot of time trying to make the test vector pass and I've finally found what was wrong:
In the spec you write that the hmac covers

  • failure_len, failuremsg, pad_len and pad.

  • The first y+1 payloads in payloads. For example, hmac_0_2 would cover all three payloads.

  • y downstream hmacs that correspond to downstream node positions relative to x. For example, hmac_0_2 would cover hmac_1_1 and hmac_2_0.

implying that we need to concatenate them in that order. But in your code you follow a different order:

// Include payloads including our own.
_, _ = hash.Write(payloads[:(NumMaxHops-position)*payloadLen])

// Include downstream hmacs.
var hmacsIdx = position + NumMaxHops
for j := 0; j < NumMaxHops-position-1; j++ {
	_, _ = hash.Write(
		hmacs[hmacsIdx*sha256.Size : (hmacsIdx+1)*sha256.Size],
	)

	hmacsIdx += NumMaxHops - j - 1
}

// Include message.
_, _ = hash.Write(message)

I think the order message + hop payloads + hmacs is more intuitive as it matches the order of the fields in the packet.

@joostjager
Copy link
Collaborator Author

Oh great catch! Will produce a new vector.

@joostjager
Copy link
Collaborator Author

@thomash-acinq updated vector

@joostjager
Copy link
Collaborator Author

Updated LND implementation with sender-picked fat error structure parameters: lightningnetwork/lnd#7139

Copy link

@GeorgeTsagk GeorgeTsagk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great proposal! Quite an elegant idea
There are some elusive details that make this work, which might need some more coverage

`-` | `-` | `-` | `hmac_0'_1` | `hmac_0'_0` | `hmac_1'_0`

The former `hmac_x'_y` now becomes `hmac_x+1_y`. The left-most hmac for
each hop is discarded.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could elaborate a bit more on why these hmacs can be discarded / considered irrelevant

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a bit more explanation.

and verifies the HMAC that corresponds to the hop's position in the path, using
each hop's `um` key.

When the origin node encounters a payload that signals that it is a final
Copy link

@GeorgeTsagk GeorgeTsagk Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should also include the "unhappy" path here, focusing on:

  • why origin attributes blame/error on first decoding failure (if that's still the intended way in which things work)
  • how an intermediary node that maliciously flips bits cannot shake the blame off themselves

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are origin attributes?

I think I've elaborated on this more in the latest iteration of the PR.

Comment on lines 1127 to 1144
The former `hmac_x'_y` now becomes `hmac_x+1_y`. The left-most hmac for
each hop is discarded.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the forwarding node always pick the correct pruning positions if it doesn't know its own position in the path?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the forwarding node can pick the correct pruning positions. The block of hmacs that the forwarding node receives contains series of hmacs for all possible path lengths up to 20 hops. The forwarding node obviously knows that it is also part of the path, so there can never be an additional 20 hops (this would bring the total to 21). This also means that the last hmac for every hmac series won't ever be useful.

Will try to add some more text to explain this.


At each step backwards, one hmac for every hop can be pruned. Rather than
holding on to 20 * 20 = 400 hmacs, pruning reduces the total space requirement
to 210 hmacs. More on pruning below.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to 210 hmacs

there was the (max_hops * (max_hops + 1)) / 2 formula in a previous diff, explaining how this number was produced, seems to have been trimmed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-added explanation of 210.

channel.
The per-hop payload consists of the following fields:
* [`byte`:`payload_source`]
* [`uint32`:`hold_time_ms`]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I now realize is that even if there's a node on the path that messes with the data and/or hmacs, the upstream nodes will still have valid hmacs and also valid hold_time_ms.

Information that isn't all that useful in that case though. The sender will simply proceed with penalizing the bad node, retry, and only process timing information after "proper" failures.

@joostjager
Copy link
Collaborator Author

If we go for the idea described in #1044 (comment), it may be unnecessary to keep that first payload byte that is either 0 or 1 to indicate intermediate or failing node. Because the 'legacy' hmac is still present, that can also be used to identify the failing node.

@joostjager joostjager changed the title Attributable errors (feature 36/37) Attributable failures (feature 36/37) Feb 27, 2025
@joostjager
Copy link
Collaborator Author

Thanks for your review @GeorgeTsagk. Comments addressed.

@joostjager
Copy link
Collaborator Author

PR rebased and updated to reflect the new approach with a tlv extension to update_fail_htlc.

calculated. The redundant HMACs will cover portions of the zero-initialized
data.

Finally a new key is generated, using the key type `ammagext`. This key is then
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to derive a new key rather than continue pulling bits for chacha from the ammag key, to minimize the chance of somehow reusing the stream. Alternatively could use a different nonce, but I believe that would be new in lightning?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to derive a new key rather than continue pulling bits for chacha from the ammag key, to minimize the chance of somehow reusing the stream.

How does deriving a new key do a better job at avoiding key reuse than pull from the stream? The stream is generated via ECDH between the sender and a given hop, you mix in both sources of randomness (sender + node), thereby strengthening the entropy of the stream.

On the other-hand, the strength of this new key would depend only on the entropy sources of that intermediate node (may be missing some context, drive by review here from the GH UI diff).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, new key doesn't do a better job than pulling from the stream. It's just that we now have two fields to encrypt and I was thinking of the case of a programming error being made where for each of these fields, the stream is initialized from the same key producing the same bits.

I think it just comes down to personal preference, no strong argument against or in favor of a new key.

@joostjager joostjager force-pushed the fat-errors branch 3 times, most recently from bdf9e56 to c2459c7 Compare February 28, 2025 14:49
@joostjager
Copy link
Collaborator Author

Updated test vectors

@joostjager
Copy link
Collaborator Author

Sentence added that states that nodes may report a zero htlc hold time in case they have no timing information available.

@joostjager
Copy link
Collaborator Author

Open question is how the odd/even feature bits should be interpreted. What does an odd feature bit mean for attributable failures? Nodes supporting it will just always add attribution_data and don't have any requirements regarding their peers. Perhaps even could mean that they'd require attribution_data from their downstream peer?

For sending nodes, signaling even could indicate that if you want to serve this sender as a path node, you'll have to provide attribution_data?

@DerEwige
Copy link

@rustyrussell asked me to post my latency data in the attributable failures PR.
I hope this is the right place, if not please just point me in the right direction.

I've been tracking latency for all may payments for over a year now and ban channels that are slow.
This is done mainly to avoid the hightened FC risk that comes with stuck HTLCs.

I do not keep historic data, but rather just have a moving window of the last few days.
The current percentiles look like this:

HTLC processing time per hop

successful payments
p50   0.69s
p90   2.63s
p99  10.75s

failed payments
p50   0.66s
p90   3.73s
p99  14.61s 

If you need more details or info on how the data was collected.
I would be happy to provide

@GeorgeTsagk
Copy link

For sending nodes, signaling even could indicate that if you want to serve this sender as a path node, you'll have to provide attribution_data?

Given that supporting nodes will do this by default I think there's no extra value in senders advertising that they prefer attr failures. I think the latest approach where attribution_data are always set by default is sufficient signal-wise.

Also this feature does not introduce any extra risk to your own node, so people will by default use it when it becomes available IMO. No need to create a "push" in feature adoption by having senders require it explicitly

@joostjager
Copy link
Collaborator Author

Yes agreed that sender signaling is probably too much. Was just wondering what that even feature bit could mean and whether we want to define it at all?


Each HMAC covers the following data:

* The `reason` field.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judging by the test vectors, you're not using the reason field but the "return packet" which is the same but before being xored with the random stream.
I actually think it would make more sense to use the reason field (the serialized packet xored with the pseudo random stream) as it would make computing the attribution data completely independent:

  • First you build the "return packet" and xor it to obtain the reason field
  • And then, if you have "option_attributable_failure" enabled, you compute the attribution data using the already built reason field.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct indeed, I use the "unencrypted" reason field as input for the attribution data. That's also how it was done in the 2023 version of this.

We could xor first, but I am not sure what advantage that would give? You mention attribution data being completely independent, but another perspective is that it is actually more dependent, because it is now also dependent on the cypher stream. To me, processing the failure entirely and then encrypting all of it feels more logical, but there is probably not much difference between the two.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed commit that fixes incorrect references to reason. Thanks for pointing out!

I've kept the protocol to encrypt-after-attribution for now, but open to discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.