Skip to content

feat: Optimize SmartCn Dictionaries and Add Dictionary Loading Tests #1154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

NehanPathan
Copy link
Contributor


🎯 Objective:

This pull request (PR) optimizes the SmartCn dictionary loading process and introduces unit tests to ensure correctness and maintainability.


🔥 Key Changes:

1. Dictionary Optimization:

  • Replaced ByteBuffer with BinaryReader.ReadInt32() for faster and more efficient data reading.
  • Implemented ReadOnlySpan<char> to minimize memory usage and improve overall performance.

2. Comprehensive Unit Tests Added:

  • Test File: DictionaryTests.cs

    • Contains tests for loading dictionaries and verifying dictionary operations.
  • BigramDictionary Tests:

    • GetInstance() method to ensure correct singleton instantiation.
    • LoadFromFile() method to verify successful loading of the dictionary from bigramDict.dct.
    • GetFrequency() method to test frequency retrieval of valid and non-existent entries.
  • WordDictionary Tests:

    • GetInstance() to confirm proper instantiation.
    • Future tests can be added if LoadMainDataFromFile() becomes accessible (Currently it is private method).

3. Resource Files Added:

  • Location: Lucene.Net.Tests.Analysis.SmartCn.Resources
    • bigramDict.dct
    • coreDict.dct

4. Embedded Resource Loading:

  • Embedded both .dct files as resources in the test assembly to eliminate external dependencies.
  • Created a utility in LuceneTestCase to extract these resources as temporary files during tests.

🧪 Testing Details:

📂 Test Scenarios:

  • Validated successful loading of both bigramDict.dct and coreDict.dct from embedded resources.
  • Checked frequency retrieval for valid entries (hello, world) and ensured non-existent entries return 0.
  • Verified that the GetInstance() method returns a non-null singleton instance.

Assertions Included:

  • Frequency correctness for known entries.
  • Proper dictionary instantiation.
  • No regression in dictionary functionality.

🚀 Why These Changes?

💡 Performance Improvements:

  • Faster dictionary loading with reduced memory overhead.

💡 Increased Test Coverage:

  • Ensures that dictionary operations work correctly and efficiently.

💡 Simplified Testing Workflow:

  • Embedded resource handling eliminates file path dependencies.

📝 Future Considerations:

🔍 Testing WordDictionary:

  • Currently limited to GetInstance() due to private access of LoadMainDataFromFile().
  • Additional tests can be added when the method’s visibility is updated.

🚀 Performance Enhancements:

  • Future work may include further performance optimization of dictionary lookups and hash collision handling.

📂 Issue Reference:

Fixes #1153


🔎 Checklist:


📄 How to Run Tests:

  1. Build the solution using dotnet build.
  2. Run tests using dotnet test to verify that all dictionary operations work correctly.

- Replaced ByteBuffer with BinaryReader for efficiency.
- Used ReadOnlySpan<char> in BigramDictionary.
- Added tests for dictionary loading from embedded resources.
- Embedded bigramDict.dct and coreDict.dct.
@NehanPathan NehanPathan force-pushed the feat/optimize-smartcn-dictionaries branch from cdcc306 to 12223a4 Compare April 2, 2025 06:01
@NehanPathan
Copy link
Contributor Author

Hey Shad [@NightOwl888],

I recently updated my commit because I forgot to add comments in my code earlier.

  • I also noticed that the LoadMainDataFromFile() method remained internal instead of private—this was unintentional, as I had temporarily changed it for experimental testing but forgot to revert it back.
  • In yesterday’s commit, I reverted it back to private and added a comment explaining its purpose to avoid confusion in the future.
  • The force push was only to include this minor correction, and no other changes were made to the PR.

I’d appreciate it if you could review this PR when you get a chance. Let me know if any further modifications are needed!

Also, I’m planning to draft my Google Summer of Code (GSoC) proposal for Lucene.NET. If any maintainers are available, I’d love to share it for feedback. Please let me know if that would be possible.

Looking forward to your response. Thanks! 😊🚀

@paulirwin paulirwin self-assigned this Apr 8, 2025
@paulirwin paulirwin self-requested a review April 8, 2025 02:01
@paulirwin paulirwin added the notes:performance-improvement Contains a performance improvement label Apr 8, 2025
Copy link
Contributor

@paulirwin paulirwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR! I think most of the core logic looks correct, but there are some changes needed to adopt our practices with how we maintain existing code and match the upstream Lucene code. In particular, try to avoid adding unnecessary comments that do not exist upstream, unless it's to point out something lucenenet-specific that deviates from upstream, or to clarify unclear logic. Avoid adding comments to every line of code, as the code should generally be self-explanatory. Also, please make sure to not remove comments that existed previously in Lucene.NET that still apply, or comments that exist upstream.

These changes should all be pretty small and quick to resolve, but my apologies for the quantity of them. I just wanted to help clarify where comments should be removed, added, or restored.

@NehanPathan NehanPathan force-pushed the feat/optimize-smartcn-dictionaries branch from 0334ead to 76d55f6 Compare April 8, 2025 09:08
@NehanPathan NehanPathan force-pushed the feat/optimize-smartcn-dictionaries branch from 76d55f6 to fa8fc77 Compare April 8, 2025 09:17
@NehanPathan
Copy link
Contributor Author

NehanPathan commented Apr 8, 2025


[@paulirwin]
Hi Paul, thanks for the detailed feedback!

All suggested changes implemented:

  • Updated code comments to align with upstream Lucene and Lucene.NET standards.
  • Restored previously removed comments where appropriate, and added // LUCENENET tags for custom logic or deviations.
  • Lowercased bigramdict.dct filename to match upstream expectations and ensure compatibility on case-sensitive file systems.
  • Added [LuceneNetSpecific] attribute to custom test classes.

🛠️ About the try-catch (EndOfStreamException) block:
We retained a minimal try-catch only around the ReadInt32() line within the character loop to handle EndOfStreamException. This exception arises when the file ends naturally (not due to corruption), which is expected behavior for our test data.
Without this handling, DictionaryTests fail because the test file contains fewer entries than the full GB2312 character set (6768). This pattern reflects how earlier Lucene.NET logic handled the end of file — exiting cleanly once the data is exhausted.

✅ All tests are passing now.


🔁 Force-pushed twice for clarity:

  1. First push: Removed the inline comment // Reached end of file — assume remaining chars are missing above the break; to reduce verbosity.
  2. Second push: Re-added a shorter version // Reached end of file to retain context without clutter.

This keeps the codebase clean, but still understandable for future maintainers.


📢 GSoC Update:
I’ve officially submitted my Google Summer of Code 2025 proposal for Lucene.NET. Tomorrow is the final day for mentor feedback — if you get a moment, I’d truly appreciate any suggestions or thoughts you might have!

📄 Proposal (Google Docs, comments enabled):
👉 [Google Summer of Code 2025 Proposal](https://docs.google.com/document/d/1uWD45L8iNZ1A102Gxbf23ZABdGjHmh5Xm6SAZyWQHO0/edit?usp=sharing)

Thanks again for your time and support! 🙏


@paulirwin paulirwin requested a review from NightOwl888 April 15, 2025 13:13
Copy link
Contributor

@paulirwin paulirwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the feedback! This looks good to me, but I know @NightOwl888 wanted to review as well.

Copy link
Contributor

@NightOwl888 NightOwl888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. My apologies for the delay.

Given that this is a port from Java, there are a few practices to keep in mind when making changes such as these. We won't be porting from scratch again. We will be upgrading this code multiple times to keep up with the upstream Java code, so it is important that we follow similar practices as were done in Java to make this upgrade process as painless as possible. So, in short, "maintainable" for us means something very different than most .NET projects. We want to stay closely aligned with the upstream code rather than moving things around to make them more "readable", which would hinder the upgrade process.

  1. Variable declarations should be left the same as in Java unless we have a specific reason to change them.
  2. Loop style should match the upstream code (even if it looks strange), so we don't have to re-analyze the business logic every time we apply future changes from Lucene.
  3. In general, file formats should be portable between .NET and Java. There are some exceptions where in Java they were serializing objects using a proprietary Java format where we have deviated from this (including the bigramdict.mem and coredict.mem files in this project), but these files are not serviceable by users, anyway.
  4. Temp files should always be created using LuceneTestCase.CreateTempFile() or (as in this case) grouped together using LuceneTestCase.CreateTempDir() to ensure that they are not left on disk when the tests are finished running.
  5. New files that do not exist upstream should all go into a Support folder, but Support should generally not be part of the namespace. Comparing directories on disk is the simplest way to determine if we have a matching set of files as upstream, but these extra files should be physically separated.

I have left several comments inline with more detail, but do note that many of them are very repetitive.

@NehanPathan
Copy link
Contributor Author

[@NightOwl888 ]

Hey Shad! 😊

Just wanted to share that I’ve pushed the final changes based on your suggestions — thank you so much for the clear and helpful feedback throughout. Here's a quick overview of what’s been improved:

  • Refactored AbstractDictionary, WordDictionary, and BigramDictionary to enhance code structure, clarity, and maintainability, following your suggestions for the upstream.
  • Cleaned up and finalized TestBuildDictionary.cs to validate both dictionary types correctly
  • Added the zipped file to keep things tidy and lightweight

Thanks again for all your support — your feedback really helped me sharpen the code and improve my understanding! 🙌
If there's anything else to update or improve, let me know....


string tmpword;

// LUCENENET: Removed buffer and intBuffer arrays since BinaryReader handles reading values directly in a more type-safe and readable way.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this comment to say intBuffer array, since the buffer variable has been restored.


length = buffer[1];
if (length > 0)
if (length > 0 && length <= MAX_VALID_LENGTH && dctFile.Position + length <= dctFile.Length)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is MAX_VALID_LENGTH value of 1000 based on? I don't see it upstream and it isn't clear why there is a max limitation of 1000 here.

Ideally we would leave the same condition (length > 0) here as it was upstream so we get the same failure conditions, but if there is some reasoning behind this change that isn't commented, please explain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


"Hi, regarding the maxLength check:

  • The maxLength was originally used to restrict the length of words read from the dictionary file, likely to avoid reading overly large or corrupted entries. However, this constraint wasn’t necessary in the current implementation, as there was no upstream requirement for it and our test cases also pass smoothly.
  • As such, we’ve removed the maxLength check for now. If it becomes necessary in the future (for example, to limit word sizes or handle specific use cases), we can easily reintroduce it.


Span<int> buffer = stackalloc int[3];

// LUCENENET: Removed buffer and intBuffer arrays since BinaryReader handles reading values directly in a more type-safe and readable way.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this comment to say intBuffer array, since the buffer variable has been restored.

int[]
buffer = new int[3];
byte[] intBuffer = new byte[4];
string tmpword;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please declare tmpword after buffer below, not inline.


// Delete the original .dict file
string dictFile = System.IO.Path.Combine(tempDir.FullName, "bigramdict.dct");
if (File.Exists(dictFile))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than putting an if block here, make this an assert. We want the test to fail if the file isn't in the temp directory.


// Delete the original .dict file
string dictFile = System.IO.Path.Combine(tempDir.FullName, "coredict.dct");
if (File.Exists(dictFile))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than putting an if block here, make this an assert. We want the test to fail if the file isn't in the temp directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[@NightOwl888 ]

"Hi Shad,

I’ve made the changes you requested and added some clarifications:

  • The condition if (length > 0 && dctFile.Position + length <= dctFile.Length) has been kept as is, as it ensures that the length does not cause any out-of-bounds reads. The check for dctFile.Position + length <= dctFile.Length is important because it verifies that we are not attempting to read beyond the end of the file. While we could remove it, this check is there to ensure the stability of the read operation, and I recommend keeping it unless there is a specific reason to remove it.

  • The MAX_VALID_LENGTH limitation (set to 1000) was introduced to restrict excessively large reads, which could potentially cause issues with memory consumption. However, I understand this value isn't part of the upstream logic. If this limitation is unnecessary for now, it can be removed. That said, if we encounter scenarios in the future where controlling the maximum length becomes important, we can reintroduce it at that point.

  • Updated the comments to correctly reference the intBuffer array, as it was restored in the code.

  • Moved the tmpword declaration below the buffer as per your suggestion.

  • Replaced the if block for checking the existence of the .dict file with an assert to ensure the test fails if the file is missing in the temp directory.

Let me know if you’d like to make any further changes or if you have any additional feedback!


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
notes:performance-improvement Contains a performance improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eliminate usage of ByteBuffer in SmartCn
3 participants