Qwen2.5 Turbo (1M Context)

Following the release of Qwen2.5, the team responded to the community's demand for handling longer contexts. Over the past few months, significant optimizations have been made to enhance the model's capabilities and inference performance for extremely long contexts. Now, the team is proud to introduce the new Qwen2.5-Turbo model, featuring the following advancements:

Extended Context Support: The context length has been increased from 128k to 1M tokens, equivalent to approximately 1 million English words or 1.5 million Chinese characters. This capacity corresponds to 10 full-length novels, 150 hours of speech transcripts, or 30,000 lines of code. Qwen2.5-Turbo achieves 100% accuracy in the 1M-token Passkey Retrieval task and scores 93.1 on the RULER long-text evaluation benchmark, outperforming GPT-4 (91.6) and GLM4-9B-1M (89.9). Moreover, the model retains strong performance in short sequence tasks, comparable to GPT-4o-mini.
Faster Inference Speed: Leveraging sparse attention mechanisms, the time to generate the first token for a 1M-token context has been reduced from 4.9 minutes to just 68 seconds, representing a 4.3x speed improvement.
Cost Efficiency: The pricing remains unchanged at $0.05 per 1M tokens. At this rate, Qwen2.5-Turbo processes 3.6 times more tokens than GPT-4o-mini for the same cost.