A Vision-Language Model with 1M Context and Grounding
GLM-4.5V is zai-org's latest vision-language model featuring 1M context length, superior multimodal capabilities, and advanced grounding functionality.
Advanced visual question answering and detailed image analysis
# Install dependencies pip install torch torchvision transformers pip install git+https://github.com/zai-org/glm-4.5v
2024-12-19
Initial release with comprehensive vision-language capabilities
If you use GLM-4.5V in your research, please cite:
@misc{glm45v2024, title={GLM-4.5V: A Vision-Language Model with 1M Context