Skip to content

Conversation

@F0rJay
Copy link
Contributor

@F0rJay F0rJay commented Dec 25, 2025

Description

This PR improves the token counting mock in the test_token_predictor test by replacing the TODO comment with a proper mock implementation. The test now uses a predictable mock tokenizer function instead of relying on the real tokenizer, making the test more reliable and consistent.

Motivation

The previous test had a TODO comment indicating that token counting should be mocked more carefully. The test was relying on the real tokenizer, which could lead to inconsistent test results and dependencies on external tokenizer behavior. This change improves test reliability by providing a deterministic mock tokenizer.

Changes

  • Added mock_tokenizer_fn function that splits text by spaces and newlines for predictable token counting
  • Patched get_tokenizer using @patch decorator to return the mock function
  • Removed TODO comment and added explanatory comments
  • Ensures consistent and predictable token counting behavior in tests

Fixes # (issue) - N/A (test improvement, no issue filed)

New Package?

Did I fill in the tool.llamahub section in the pyproject.toml and provide a detailed README.md for my new integration or package?

  • Yes
  • No (This is a test improvement, not a new package)

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No (Test-only change in llama-index-core, no version bump needed)

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue) - Test improvement/fix
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Your pull-request will likely not be merged unless it is covered by some form of impactful unit testing.

  • I added new unit tests to cover this change
  • I believe this change is already covered by existing unit tests

Note: This change improves an existing test (test_token_predictor) by adding proper mocking. The test itself validates that the token predictor works correctly with the mocked tokenizer. The improvement ensures:

  • Token counting is now predictable and consistent
  • Tests no longer depend on real tokenizer behavior
  • The mock function properly handles text splitting by spaces and newlines

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (N/A - test-only change)
  • I have added Google Colab support for the newly added notebooks (N/A - no notebooks)
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works (Improved existing test)
  • New and existing unit tests pass locally with my changes (Test improvement, should pass)
  • I ran uv run make format; uv run make lint to appease the lint gods (Should be run before final submission)

Additional Notes

  • The mock tokenizer function (mock_tokenizer_fn) splits text by spaces and newlines, providing a simple but effective tokenization strategy for testing
  • The patch is applied at the get_tokenizer level, ensuring all code paths that use tokenization will use the mock
  • This change improves test reliability without changing any production code
  • The test continues to validate the same functionality, but now with more predictable behavior

- Add installation and prerequisites section
- Add basic usage examples (single and batch embeddings)
- Add integration examples with VectorStoreIndex and custom LLMs
- Document all configuration options and parameters
- Add examples for instruction functionality
- Add async usage examples
- Add remote server configuration guide
- Add troubleshooting section
- Include available models list

This improves the documentation from a single-line title to a complete
guide that helps users understand how to use the Ollama embeddings
integration with LlamaIndex.
- Add mock_tokenizer_fn function for predictable token counting
- Patch get_tokenizer to use mock function in tests
- Remove TODO comment and add explanatory comments
- Ensures consistent and predictable token counting behavior in tests
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Dec 25, 2025
# TODO: mock token counting a bit more carefully
# Mock token counting by patching get_tokenizer to return our mock function.
# This ensures consistent and predictable token counting behavior in tests,
# replacing the TODO: mock token counting a bit more carefully
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh? Thats a weird comment to leave

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants