Skip to content

Impl intoBitset for IndexedDISI and Docvalues #14529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

gf2121
Copy link
Contributor

@gf2121 gf2121 commented Apr 21, 2025

Implement intoBitset for IndexedDISI and Docvalues.

intoBitset of Docvalues has already been called in competitive iterators, and can also be used to speed up soft delete operations.

Logic of this impl is different from default impl when calling intoBitset after advanceExact return false, where default impl starts from a not-existing doc but this impl starts from the next doc. I do not know which one is intended, maybe this should just be undefined.

relates: #14521

@jpountz
Copy link
Contributor

jpountz commented Apr 22, 2025

To keep things as simple as possible, I'm wondering about:

  • Using a very simple/robust impl in the SPARSE case that moves to the next doc in a loop until either the end of block or upTo, since it should never be a bottleneck (especially as DenseConjunctionBulkScorer is only used on dense iterators)?
  • Refactoring DENSE to load doc IDs into an in-memory bit set all the time, so that intoBitSet could directly reuse FixedBitSet.orRange.

@gf2121
Copy link
Contributor Author

gf2121 commented Apr 22, 2025

Thanks for feedback,

Using a very simple/robust impl in the SPARSE case that moves to the next doc in a loop until either the end of block or upTo, since it should never be a bottleneck (especially as DenseConjunctionBulkScorer is only used on dense iterators)?

I won't assume intoBitset will only be called by DenseConjunctionBulkScorer. I was also thinking about using it for soft delete (#14521) or docvalues' merge. IndexedDISI#writeBitset will always load the whole block of docs into a bitset.

But I'm not sure how much performance this approach gains, i agree to keep it simple as a first step.

Refactoring DENSE to load doc IDs into an in-memory bit set all the time, so that intoBitSet could directly reuse FixedBitSet.orRange.

I'm a bit confused on this, current patch has already been loading docs into a in-memory bitset and reuse orRange, could you be more specific?

@jpountz
Copy link
Contributor

jpountz commented Apr 22, 2025

I'm a bit confused on this, current patch has already been loading docs into a in-memory bitset and reuse orRange, could you be more specific?

I was wondering if we should go one step further and load the bitset into heap eagerly just after reading the block header, and then using it for nextDoc/advance too, instead of doing it lazily and just for intoBitSet.

@gf2121
Copy link
Contributor Author

gf2121 commented Apr 23, 2025

I'm not sure, would this cause heavy overreading in scenarios where only advance/advanceExact is required a few times in this block? E.g. we saw aggregation on hundreds of metrics with small total hits, which could be the case harmed by the overreading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants