-
Notifications
You must be signed in to change notification settings - Fork 647
feat: Optimize SmartCn Dictionaries and Add Dictionary Loading Tests #1154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat: Optimize SmartCn Dictionaries and Add Dictionary Loading Tests #1154
Conversation
- Replaced ByteBuffer with BinaryReader for efficiency. - Used ReadOnlySpan<char> in BigramDictionary. - Added tests for dictionary loading from embedded resources. - Embedded bigramDict.dct and coreDict.dct.
cdcc306
to
12223a4
Compare
Hey Shad [@NightOwl888], I recently updated my commit because I forgot to add comments in my code earlier.
I’d appreciate it if you could review this PR when you get a chance. Let me know if any further modifications are needed! Also, I’m planning to draft my Google Summer of Code (GSoC) proposal for Lucene.NET. If any maintainers are available, I’d love to share it for feedback. Please let me know if that would be possible. Looking forward to your response. Thanks! 😊🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR! I think most of the core logic looks correct, but there are some changes needed to adopt our practices with how we maintain existing code and match the upstream Lucene code. In particular, try to avoid adding unnecessary comments that do not exist upstream, unless it's to point out something lucenenet-specific that deviates from upstream, or to clarify unclear logic. Avoid adding comments to every line of code, as the code should generally be self-explanatory. Also, please make sure to not remove comments that existed previously in Lucene.NET that still apply, or comments that exist upstream.
These changes should all be pretty small and quick to resolve, but my apologies for the quantity of them. I just wanted to help clarify where comments should be removed, added, or restored.
0334ead
to
76d55f6
Compare
76d55f6
to
fa8fc77
Compare
[@paulirwin] ✅ All suggested changes implemented:
🛠️ About the ✅ All tests are passing now. 🔁 Force-pushed twice for clarity:
This keeps the codebase clean, but still understandable for future maintainers. 📢 GSoC Update: 📄 Proposal (Google Docs, comments enabled): Thanks again for your time and support! 🙏 |
src/Lucene.Net.Tests.Analysis.SmartCn/Lucene.Net.Tests.Analysis.SmartCn.csproj
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the feedback! This looks good to me, but I know @NightOwl888 wanted to review as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. My apologies for the delay.
Given that this is a port from Java, there are a few practices to keep in mind when making changes such as these. We won't be porting from scratch again. We will be upgrading this code multiple times to keep up with the upstream Java code, so it is important that we follow similar practices as were done in Java to make this upgrade process as painless as possible. So, in short, "maintainable" for us means something very different than most .NET projects. We want to stay closely aligned with the upstream code rather than moving things around to make them more "readable", which would hinder the upgrade process.
- Variable declarations should be left the same as in Java unless we have a specific reason to change them.
- Loop style should match the upstream code (even if it looks strange), so we don't have to re-analyze the business logic every time we apply future changes from Lucene.
- In general, file formats should be portable between .NET and Java. There are some exceptions where in Java they were serializing objects using a proprietary Java format where we have deviated from this (including the
bigramdict.mem
andcoredict.mem
files in this project), but these files are not serviceable by users, anyway. - Temp files should always be created using
LuceneTestCase.CreateTempFile()
or (as in this case) grouped together usingLuceneTestCase.CreateTempDir()
to ensure that they are not left on disk when the tests are finished running. - New files that do not exist upstream should all go into a
Support
folder, butSupport
should generally not be part of the namespace. Comparing directories on disk is the simplest way to determine if we have a matching set of files as upstream, but these extra files should be physically separated.
I have left several comments inline with more detail, but do note that many of them are very repetitive.
…igram and WordDictionary
[@NightOwl888 ]Hey Shad! 😊 Just wanted to share that I’ve pushed the final changes based on your suggestions — thank you so much for the clear and helpful feedback throughout. Here's a quick overview of what’s been improved:
Thanks again for all your support — your feedback really helped me sharpen the code and improve my understanding! 🙌 |
string tmpword; | ||
|
||
// LUCENENET: Removed buffer and intBuffer arrays since BinaryReader handles reading values directly in a more type-safe and readable way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update this comment to say intBuffer array
, since the buffer
variable has been restored.
|
||
length = buffer[1]; | ||
if (length > 0) | ||
if (length > 0 && length <= MAX_VALID_LENGTH && dctFile.Position + length <= dctFile.Length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is MAX_VALID_LENGTH value of 1000 based on? I don't see it upstream and it isn't clear why there is a max limitation of 1000 here.
Ideally we would leave the same condition (length > 0)
here as it was upstream so we get the same failure conditions, but if there is some reasoning behind this change that isn't commented, please explain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Hi, regarding the maxLength
check:
- The
maxLength
was originally used to restrict the length of words read from the dictionary file, likely to avoid reading overly large or corrupted entries. However, this constraint wasn’t necessary in the current implementation, as there was no upstream requirement for it and our test cases also pass smoothly. - As such, we’ve removed the
maxLength
check for now. If it becomes necessary in the future (for example, to limit word sizes or handle specific use cases), we can easily reintroduce it.
|
||
Span<int> buffer = stackalloc int[3]; | ||
|
||
// LUCENENET: Removed buffer and intBuffer arrays since BinaryReader handles reading values directly in a more type-safe and readable way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update this comment to say intBuffer array
, since the buffer
variable has been restored.
int[] | ||
buffer = new int[3]; | ||
byte[] intBuffer = new byte[4]; | ||
string tmpword; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please declare tmpword
after buffer
below, not inline.
|
||
// Delete the original .dict file | ||
string dictFile = System.IO.Path.Combine(tempDir.FullName, "bigramdict.dct"); | ||
if (File.Exists(dictFile)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than putting an if block here, make this an assert. We want the test to fail if the file isn't in the temp directory.
|
||
// Delete the original .dict file | ||
string dictFile = System.IO.Path.Combine(tempDir.FullName, "coredict.dct"); | ||
if (File.Exists(dictFile)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than putting an if block here, make this an assert. We want the test to fail if the file isn't in the temp directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[@NightOwl888 ]
"Hi Shad,
I’ve made the changes you requested and added some clarifications:
-
The condition
if (length > 0 && dctFile.Position + length <= dctFile.Length)
has been kept as is, as it ensures that the length does not cause any out-of-bounds reads. The check fordctFile.Position + length <= dctFile.Length
is important because it verifies that we are not attempting to read beyond the end of the file. While we could remove it, this check is there to ensure the stability of the read operation, and I recommend keeping it unless there is a specific reason to remove it. -
The
MAX_VALID_LENGTH
limitation (set to 1000) was introduced to restrict excessively large reads, which could potentially cause issues with memory consumption. However, I understand this value isn't part of the upstream logic. If this limitation is unnecessary for now, it can be removed. That said, if we encounter scenarios in the future where controlling the maximum length becomes important, we can reintroduce it at that point. -
Updated the comments to correctly reference the
intBuffer
array, as it was restored in the code. -
Moved the
tmpword
declaration below the buffer as per your suggestion. -
Replaced the
if
block for checking the existence of the.dict
file with an assert to ensure the test fails if the file is missing in the temp directory.
Let me know if you’d like to make any further changes or if you have any additional feedback!
…fine maxlength usage
🎯 Objective:
This pull request (PR) optimizes the SmartCn dictionary loading process and introduces unit tests to ensure correctness and maintainability.
🔥 Key Changes:
✅ 1. Dictionary Optimization:
ByteBuffer
withBinaryReader.ReadInt32()
for faster and more efficient data reading.ReadOnlySpan<char>
to minimize memory usage and improve overall performance.✅ 2. Comprehensive Unit Tests Added:
Test File:
DictionaryTests.cs
BigramDictionary Tests:
GetInstance()
method to ensure correct singleton instantiation.LoadFromFile()
method to verify successful loading of the dictionary frombigramDict.dct
.GetFrequency()
method to test frequency retrieval of valid and non-existent entries.WordDictionary Tests:
GetInstance()
to confirm proper instantiation.LoadMainDataFromFile()
becomes accessible (Currently it is private method).✅ 3. Resource Files Added:
Lucene.Net.Tests.Analysis.SmartCn.Resources
bigramDict.dct
coreDict.dct
✅ 4. Embedded Resource Loading:
.dct
files as resources in the test assembly to eliminate external dependencies.LuceneTestCase
to extract these resources as temporary files during tests.🧪 Testing Details:
📂 Test Scenarios:
bigramDict.dct
andcoreDict.dct
from embedded resources.hello
,world
) and ensured non-existent entries return0
.GetInstance()
method returns a non-null singleton instance.✅ Assertions Included:
🚀 Why These Changes?
💡 Performance Improvements:
💡 Increased Test Coverage:
💡 Simplified Testing Workflow:
📝 Future Considerations:
🔍 Testing
WordDictionary
:GetInstance()
due to private access ofLoadMainDataFromFile()
.🚀 Performance Enhancements:
📂 Issue Reference:
Fixes #1153
🔎 Checklist:
📄 How to Run Tests:
dotnet build
.dotnet test
to verify that all dictionary operations work correctly.