zai-org/GLM-4.5V

A Vision-Language Model with 1M Context and Grounding

GLM-4.5V: The Multimodal Powerhouse

GLM-4.5V is zai-org's latest vision-language model featuring 1M context length, superior multimodal capabilities, and advanced grounding functionality.

Multimodal Capabilities

Image Reasoning

Advanced visual question answering and detailed image analysis

Implementation & Usage

Quick Start

Environment Setup

# Install dependencies pip install torch torchvision transformers pip install git+https://github.com/zai-org/glm-4.5v

Project Updates

Latest Release: v1.0.0

2024-12-19

Initial release with comprehensive vision-language capabilities

If you use GLM-4.5V in your research, please cite:

@misc{glm45v2024,
title={GLM-4.5V: A Vision-Language Model with 1M Context