Skip to content

Add Gemini Vision interactive CLI tool for image analysis #1105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

emredeveloper
Copy link

  • Create interactive CLI tool using smolagents and Gemini Vision API
  • Add image analysis, code extraction, and comparison features
  • Include documentation and requirements
  • Implement user-friendly command interface
  • Support code detection and automatic file saving

emre added 3 commits March 30, 2025 04:54
- Create interactive CLI tool using smolagents and Gemini Vision API
- Add image analysis, code extraction, and comparison features
- Include documentation and requirements
- Implement user-friendly command interface
- Support code detection and automatic file saving
…lity

- Implement focused CodeAgent with strict system instructions
- Add screenshot capture and analysis capabilities
- Improve tool interaction with more deterministic behaviors
- Fix issue where tools would perform unintended additional actions
- Add auto-correction for common command syntax
- Update documentation with improved usage examples
- Support direct function calling in CodeAgent mode
analyze_screenshot
]

# Özel bir sistem prompt'u tanımla
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some Turkish comments here and there: please translate them to English!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did

This commit translates all Turkish comments in the gemini_vision_agent.py file to English to maintain consistent documentation and improve code readability for international developers.

Changes include:
- Translating temperature reduction comment
- Translating system prompt definition comment
- Translating max_steps adjustment comment
- Translating base_tools configuration comment
- Translating system_prompt parameter comment

The code functionality remains unchanged; this is purely a documentation improvement.
verbosity_level=LogLevel.INFO,
max_steps=7, # Reduce processing steps
add_base_tools=False, # Disable basic tools to focus on vision capabilities
system_prompt=system_prompt # Add custom system instructions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test the agent with latest version?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've missed applying some of the tests; I'll fix them and get back to you with feedback

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make quality
ruff check examples src tests utils
All checks passed!
ruff format --check examples src tests utils
64 files already formatted
python utils/check_tests_in_ci.py
✅ All good!

emre and others added 3 commits March 31, 2025 13:56
gemini_vision_agent.py code is not the source of these issues.
This commit fixes formatting issues in the Gemini Vision Agent example to follow
the smolagents code style guidelines. The following changes were made:

- Fixed whitespace in blank lines within docstrings
- Applied proper quotation style (double quotes instead of single quotes)
- Adjusted spacing around operators and commas
- Improved indentation consistency
- Added trailing comma in multi-line collections
- Fixed line breaks according to ruff formatting rules

All quality checks are now passing: ruff check and ruff format.
@aymeric-roucher
Copy link
Collaborator

@emredeveloper please test your PR with the latest version of smolagents. In the current state it's clear you didn't, i'll let you try and find out why 😉

emre added 3 commits March 31, 2025 15:48
- Modified CodeAgent initialization to be compatible with smolagents 1.13.0.dev0
- Removed system_prompt parameter and used description instead
- Fixed test suite by properly mocking dependencies
- Ensured display_image test passes by using effective mocking
- Fixed test_create_smolagent to work with the current codebase
- Optimized imports and error handling
- Updated requirements.txt to use smolagents version 1.12.0
- Modified CodeAgent initialization to be compatible with smolagents 1.13.0.dev0
- Removed system_prompt parameter and used description instead
- Fixed test suite by properly mocking dependencies
- Ensured display_image test passes by using effective mocking
- Fixed test_create_smolagent to work with the current codebase
- Optimized imports and error handling
- Updated requirements.txt to use smolagents version 1.12.0
@emredeveloper
Copy link
Author

@aymeric-roucher I think I've understood and solved it, hopefully, I'm not wrong, haha

@aymeric-roucher
Copy link
Collaborator

aymeric-roucher commented Mar 31, 2025

@emredeveloper that's the direction! The system_prompt argument is deprecated. But this is good news, because you don't need to change the system prompt at all! Also since your agent it not a managed agent, it does not need a description.

@aymeric-roucher
Copy link
Collaborator

aymeric-roucher commented Mar 31, 2025

Also I only did superficial checks but deeper, we support using a VLM as the main model for a CodeAgent, as shown here: so using tools to do this is more setup for no benefit.

We prefer to highlight efficient/short setups in examples, so if you want to add one, it would be better to just show an agent natively using gemini to analyse images, as in: your CodeAgent's initialization should directly take a Gemini VLM for its model argument, and no dedicated tools for image analysis: just make a tool that loads an image from a folder and adds it to memory in observations_images.

@emredeveloper
Copy link
Author

@aymeric-roucher okay...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants