
On the CVTG-2K benchmark, which measures accuracy in placing text across multiple image locations, GLM-Image achieved a Word Accuracy score of 0.9116, ranking first among open-source models. The model also led the LongText-Bench test for rendering extended text passages, scoring 0.952 for English and 0.979 for Chinese across eight scenarios including signs, posters, and dialog boxes.
The model natively supports multiple resolutions from 1024×1024 to 2048×2048 pixels without requiring retraining, the report added.
Hardware optimization strategy
Training GLM-Image on Ascend hardware required Zhipu to develop custom optimization techniques for Huawei’s chip architecture. The company built a training suite that implements dynamic graph multi-level pipelined deployment, enabling different stages of the training process to run concurrently and reducing bottlenecks.

