Open Source

zai

GLM 4.5

GLM 4.5 represents a major leap in our reasoning research, packing 355 billion carefully tuned parameters into a highly efficient mixture‑of‑experts architecture. It excels at multi‑step logic, mathematical proofs, and complex code generation while keeping latency within interactive bounds. An enlarged 256 k‑token context window enables full document analysis, legal discovery, and multi‑hour meeting transcription without external chunking. Specialist sub‑networks handle science, finance, law, and creative writing, dynamically routed to maximize accuracy where it matters. Advanced alignment techniques minimize hallucinations and enforce policy compliance, making GLM 4.5 a reliable partner for demanding professional workflows.

zai

GLM 4.5 Air

AIR is a streamlined descendant of the GLM lineage designed for scenarios where every millisecond and cent count. By pruning redundant pathways and adopting low‑rank adaptation kernels, AIR delivers much of the expressive power of larger siblings while running comfortably on a single high‑end GPU or modest CPU cluster. Its latency is measured in tens of milliseconds, enabling responsive mobile chat and high‑frequency retrieval tasks. Compression‑aware training ensures knowledge retention despite the reduced footprint, keeping answers factual and coherent. With flexible quantization presets, AIR allows developers to trade accuracy for speed on the fly, optimizing cost at scale.

zai

GLM 4.5 X

GLM 4.5 X pairs the massive knowledge base of GLM 4.5 with an ultra‑fast inference stack architected for throughput‑first applications. By exploiting expert routing, flash attention, and speculative decoding, it streams answers at up to twenty thousand tokens per second on contemporary A100 clusters, rivaling much smaller models in latency. Extensive tool‑use training lets it chain API calls, run code, and retrieve documents while keeping conversation fluid. Powerful reasoning ensures accurate analysis of financial models, legal contracts, and scientific papers in real time. Organizations with demanding SLAs choose GLM 4.5 X when both depth and speed are non‑negotiable.

OpenAI

ChatGPT-4o Latest

ChatGPT-4o contains latest improvements for chat use cases, expected for testing/evaluation purpose. ChatGPT-4o also supports structured outputs, with up to 16k max output tokens GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of [GPT-4 Turbo](/models/openai/gpt-4-turbo) while being twice as fast and 50% more cost-effective. GPT-4o also offers improved performance in processing non-English languages and enhanced visual capabilities.

Vision

zai

GLM 4.5 AirX

GLM 4.5 AirX represents the sweet spot between intelligence and efficiency. It inherits the 4.5 knowledge corpus yet fits inside an eight‑billion active parameter envelope through selective activation and gated experts. With advanced quantization and compressive adapters, it runs at near X speed while staying within Air budget constraints. Benchmarks show eighty‑five percent of flagship accuracy on reasoning and coding while costing less than half to operate. It is ideal for conversational search, code review, and medium‑length content generation where quality matters but every inference cycle is billed. Smart scheduling APIs further optimize batch utilization across mixed hardware.

zai

Qwen3-235B-A22B is the flagship Mixture-of-Experts (MoE) model in the new Qwen 3 family. With 235 billion total parameters—but only 22 billion active per token—it delivers near-frontier accuracy while remaining deployable on multi-GPU clusters. The model uses 128 experts (8 are routed per token) across 94 transformer layers and employs Gated Q-Attention (64 Q-heads, 4 KV-heads) for efficient scaling. It natively handles 32 768-token contexts and has been validated up to 131 072 tokens using YaRN positional scaling. Like all Qwen 3 models, it ships under Apache-2.0, supports explicit thinking / no-thinking modes, and shows state-of-the-art reasoning, code generation, and multilingual performance among open-source LLMs.

Qwen

Qwen3-30B-A3B

Qwen3-30B-A3B is the “Pro” MoE tier designed for balanced cost-to-quality. It contains 30.5 billion total parameters, but activates only ~3.3 billion at inference—yielding GPT-3.5-class quality at a fraction of the memory footprint. The architecture mirrors the 235 B model (128 experts, 8 routed) but runs 48 transformer layers with 32/4 GQA heads. It retains the 32 k native context window, YaRN compatibility, and the same controllable thinking switch that lets developers trade raw reasoning traces for latency.

Qwen

Qwen3-32B

Qwen3-32B is a dense 32.8 billion-parameter model, positioned as the high-accuracy single-expert counterpart to the MoE line. It uses 64 transformer layers with 64/8 GQA heads and the full 32 k context window (extendable via YaRN). Because every parameter is active, it excels at deterministic generation, agentic tool-calling, and creative writing where dense representations can outperform similarly sized MoE peers. It is drop-in compatible with Hugging Face Transformers ≥ 4.51, vLLM, SGLang, and common GGUF/MLX-LM ports.

Qwen

OpenAI o3-mini is our first small reasoning model that supports highly requested developer features including function calling⁠(opens in a new window), Structured Outputs⁠(opens in a new window), and developer messages⁠(opens in a new window), making it production-ready out of the gate. Like OpenAI o1-mini and OpenAI o1-preview, o3-mini will support streaming⁠(opens in a new window). Also, developers can choose between three reasoning effort⁠(opens in a new window) options—low, medium, and high—to optimize for their specific use cases. This flexibility allows o3-mini to “think harder” when tackling complex challenges or prioritize speed when latency is a concern. o3-mini does not support vision capabilities, so developers should continue using OpenAI o1 for visual reasoning tasks.

OpenAI

The Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model is optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks.

Open Source

Qwen

Qwen2 VL 72B

### What's New in Qwen2-VL? #### [](https://huggingface.co/Qwen/Qwen2-VL-72B#key-enhancements)Key Enhancements: - SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. - Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc. - Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions. - Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.

Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: - Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. - Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. - Long-context Support up to 128K tokens and can generate up to 8K tokens. - Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more. This repo contains the instruction-tuned 7B Qwen2.5 model, which has the following features: - Type: Causal Language Models - Training Stage: Pretraining & Post-training - Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias - Number of Parameters: 7.61B - Number of Paramaters (Non-Embedding): 6.53B - Number of Layers: 28 - Number of Attention Heads (GQA): 28 for Q and 4 for KV - Context Length: Full 131,072 tokens and generation 8192 tokens - Please refer to [this section](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct#processing-long-texts) for detailed instructions on how to deploy Qwen2.5 for handling long texts. For more details, please refer to our [blog](https://qwenlm.github.io/blog/qwen2.5/), [GitHub](https://github.com/QwenLM/Qwen2.5), and [Documentation](https://qwen.readthedocs.io/en/latest/).

Open Source

Qwen

Qwen2.5 Turbo (1M Context)

Following the release of Qwen2.5, the team responded to the community's demand for handling longer contexts. Over the past few months, significant optimizations have been made to enhance the model's capabilities and inference performance for extremely long contexts. Now, the team is proud to introduce the new **Qwen2.5-Turbo** model, featuring the following advancements: - **Extended Context Support**: The context length has been increased from 128k to 1M tokens, equivalent to approximately 1 million English words or 1.5 million Chinese characters. This capacity corresponds to 10 full-length novels, 150 hours of speech transcripts, or 30,000 lines of code. Qwen2.5-Turbo achieves 100% accuracy in the 1M-token Passkey Retrieval task and scores 93.1 on the RULER long-text evaluation benchmark, outperforming GPT-4 (91.6) and GLM4-9B-1M (89.9). Moreover, the model retains strong performance in short sequence tasks, comparable to GPT-4o-mini. - **Faster Inference Speed**: Leveraging sparse attention mechanisms, the time to generate the first token for a 1M-token context has been reduced from 4.9 minutes to just 68 seconds, representing a 4.3x speed improvement. - **Cost Efficiency**: The pricing remains unchanged at $0.05 per 1M tokens. At this rate, Qwen2.5-Turbo processes 3.6 times more tokens than GPT-4o-mini for the same cost.

xai

Grok Beta

Grok-2 Beta introduces two advanced language models, with superior performance in reasoning, coding, and understanding tasks compared to prior models. Grok-2 outperforms competitors like GPT-4 Turbo on key benchmarks and includes state-of-the-art capabilities in multimodal and real-time applications. It's accessible via the platform for Premium users and will soon be available through a secure, low-latency enterprise API. These updates mark significant advancements in xAI's pursuit of cutting-edge AI development.

xai

Grok Beta

Grok-2 Beta introduces two advanced language models, with superior performance in reasoning, coding, and understanding tasks compared to prior models. Grok-2 outperforms competitors like GPT-4 Turbo on key benchmarks and includes state-of-the-art capabilities in multimodal and real-time applications. It's accessible via the platform for Premium users and will soon be available through a secure, low-latency enterprise API. These updates mark significant advancements in xAI's pursuit of cutting-edge AI development.

Vision

OpenAI

GPT 4o Mini

GPT 4o Mini ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of [GPT-4 Turbo](/models/openai/gpt-4-turbo) while being twice as fast and 50% more cost-effective. GPT-4o also offers improved performance in processing non-English languages and enhanced visual capabilities.

Vision

OpenAI

O1 Preview

The OpenAI o1 Preview models are designed to spend more time thinking before responding, improving their ability to reason through complex tasks in science, coding, and math. The first model of this series is now available in ChatGPT and the API, with regular updates expected.

OpenAI

O1 Mini

The OpenAI o1-mini is a newly released smaller version of the o1 model, designed to optimize reasoning tasks, particularly in coding. It provides advanced reasoning capabilities similar to its larger counterpart, making it well-suited for generating and debugging complex code. However, it is 80% cheaper and faster, making it a cost-effective solution for developers who need reasoning power but don’t require broad world knowledge.

Perplexity

Sonar

Lightweight offering with search grounding, quicker and cheaper than Sonar Pro

Google

Gemini 2.0 Pro 0205(Experiment)

Gemini 2.0 Pro is Google's most advanced AI model to date, designed to excel in complex tasks such as coding and handling intricate prompts. It features a substantial context window of 2 million tokens, enabling comprehensive analysis of extensive information. The model also integrates seamlessly with tools like Google Search and code execution environments, enhancing its utility for developers. Currently available in experimental form through Google AI Studio and Vertex AI, as well as to Gemini Advanced users, Gemini 2.0 Pro represents a significant leap forward in AI capabilities. citeturn0search1

Vision

Anthropic

Claude 3.5 Haiku 20241022

Claude 3.5 Haiku is the next generation of our fastest model. For the same cost and similar speed to Claude 3 Haiku, Claude 3.5 Haiku improves across every skill set and surpasses even Claude 3 Opus, the largest model in our previous generation, on many intelligence benchmarks. Claude 3.5 Haiku is particularly strong on coding tasks. For example, it scores 40.6% on SWE-bench Verified, outperforming many agents using publicly available state-of-the-art models—including the original Claude 3.5 Sonnet and GPT-4o.

Vision

Anthropic

Llama 3.2 is the latest iteration of Meta's open-source AI model family, offering enhanced capabilities and versatility. The new release includes models of various sizes: 1B, 3B, 11B, and 90B parameters. The 1B and 3B models are lightweight, multilingual, and text-only, designed for efficient deployment on mobile and edge devices. The larger 11B and 90B models are multimodal, capable of processing both text and high-resolution images. Key features of Llama 3.2 include: 1. Improved performance across over 150 benchmark datasets in multiple languages. 2. Multimodal capabilities in larger models for image understanding and visual reasoning. 3. Integration with Llama Stack, providing a streamlined developer experience with support for multiple programming languages and deployment options. 4. Enhanced support for agentic components, including tool calling, safety guardrails, and retrieval augmented generation. 5. Compatibility with various hardware platforms, including ARM, MediaTek, and Qualcomm for mobile and edge devices. Llama 3.2 has garnered significant attention, with over 350 million downloads on Hugging Face alone. It's being utilized across various industries for applications such as data privacy, productivity enhancement, contextual understanding, and solving complex business needs. The ecosystem around Llama continues to grow, with partners like Dell, Zoom, DoorDash, and KPMG leveraging the technology for diverse use cases.

Open Source

OpenAI

GPT-4o

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of [GPT-4 Turbo](/models/openai/gpt-4-turbo) while being twice as fast and 50% more cost-effective. GPT-4o also offers improved performance in processing non-English languages and enhanced visual capabilities.

Vision

Qwen

Qwen2.5 72B Instruct

Qwen2.5 is the latest series of Qwen large language models. For Qwen2.5, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters. Qwen2.5 brings the following improvements upon Qwen2: * Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains. * Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots. * Long-context Support up to 128K tokens and can generate up to 8K tokens. * Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more.

Qwen

Qwen2 VL 72B Instruct

Qwen2-VL is the latest iteration of multimodal large language models developed by the Qwen team at Alibaba Cloud. This advanced AI system represents a significant leap forward in the field of vision-language models, building upon its predecessor, Qwen-VL. Qwen2-VL boasts state-of-the-art capabilities in understanding images of various resolutions and aspect ratios, as well as the ability to comprehend videos exceeding 20 minutes in length. One of the most notable features of Qwen2-VL is its versatility as an agent capable of operating mobile devices, robots, and other systems based on visual input and text instructions. This makes it a powerful tool for a wide range of applications, from personal assistance to industrial automation. The model also offers robust multilingual support, enabling it to understand and process text in various languages within images, catering to a global user base.

Vision

Qwen

Qwen2 VL 7B Instruct

Qwen2-VL is the latest iteration of multimodal large language models developed by the Qwen team at Alibaba Cloud. This advanced AI system represents a significant leap forward in the field of vision-language models, building upon its predecessor, Qwen-VL. Qwen2-VL boasts state-of-the-art capabilities in understanding images of various resolutions and aspect ratios, as well as the ability to comprehend videos exceeding 20 minutes in length. One of the most notable features of Qwen2-VL is its versatility as an agent capable of operating mobile devices, robots, and other systems based on visual input and text instructions. This makes it a powerful tool for a wide range of applications, from personal assistance to industrial automation. The model also offers robust multilingual support, enabling it to understand and process text in various languages within images, catering to a global user base.

Vision

Mistral

Pixtral 12B(2409)

Pixtral 12B is a state-of-the-art multimodal AI model developed by Mistral AI. It combines strong visual understanding capabilities with excellent text processing, making it a versatile tool for various multimodal tasks. Key features include: * Natively multimodal architecture, trained on interleaved image and text data * 400M parameter vision encoder and 12B parameter multimodal decoder based on Mistral Nemo Support for variable image sizes and multiple images within a 128k token context window * Top-tier performance on multimodal benchmarks like MMMU (52.5%), outperforming many larger models * Maintained excellence in text-only tasks, unlike some other multimodal models Pixtral excels in tasks such as chart understanding, document question-answering, and multimodal reasoning. It's particularly strong in instruction following for both multimodal and text-only scenarios. The model can process images at their native resolution and aspect ratio, offering flexibility in token usage for image processing.

Vision

Google

Gemini Flash 1.5 0827 (experiment)

Gemini 1.5 Flash is a foundation model that performs well at a variety of multimodal tasks such as visual understanding, classification, summarization, and creating content from image, audio and video. It's adept at processing visual and text inputs such as photographs, documents, infographics, and screenshots. Gemini 1.5 Flash is designed for high-volume, high-frequency tasks where cost and latency matter. On most common tasks, Flash achieves comparable quality to other Gemini Pro models at a significantly reduced cost. Flash is well-suited for applications like chat assistants and on-demand content generation where speed and scale matter.

Vision

Google

Claude 3 Sonnet is an ideal balance of intelligence and speed for enterprise workloads. Maximum utility at a lower price, dependable, balanced for scaled deployments. See the launch announcement and benchmark results [here](https://www.anthropic.com/news/claude-3-family) #multimodal

Vision

Google

Gemini Flash 1.5 (preview)

Gemini 1.5 Flash is a foundation model that performs well at a variety of multimodal tasks such as visual understanding, classification, summarization, and creating content from image, audio and video. It's adept at processing visual and text inputs such as photographs, documents, infographics, and screenshots. Gemini 1.5 Flash is designed for high-volume, high-frequency tasks where cost and latency matter. On most common tasks, Flash achieves comparable quality to other Gemini Pro models at a significantly reduced cost. Flash is well-suited for applications like chat assistants and on-demand content generation where speed and scale matter. #multimodal

Vision

Google

GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of [GPT-4 Turbo](/models/openai/gpt-4-turbo) while being twice as fast and 50% more cost-effective. GPT-4o also offers improved performance in processing non-English languages and enhanced visual capabilities.

Vision

OpenAI

GPT-3.5 Turbo

GPT-3.5 Turbo is OpenAI's fastest model. It can understand and generate natural language or code, and is optimized for chat and traditional completion tasks. Updated by OpenAI to point to the [latest version of GPT-3.5](/models?q=openai/gpt-3.5). Training data up to Sep 2021.

OpenAI

GPT-4o 64k(alpha test version)

An experimental version of GPT-4o with a maximum of 64K output tokens per request. GPT-4o ("o" for "omni") is OpenAI's latest AI model, supporting both text and image inputs with text outputs. It maintains the intelligence level of [GPT-4 Turbo](/models/openai/gpt-4-turbo) while being twice as fast and 50% more cost-effective. GPT-4o also offers improved performance in processing non-English languages and enhanced visual capabilities.

OpenAI's flagship model, GPT-4 is a large-scale multimodal language model capable of solving difficult problems with greater accuracy than previous models due to its broader general knowledge and advanced reasoning capabilities. Training data: up to Sep 2021.

Anthropic

Claude 3 Haiku (20240307)

Claude 3 Haiku is Anthropic's fastest and most compact model for near-instant responsiveness. Quick and accurate targeted performance. See the launch announcement and benchmark results [here](https://www.anthropic.com/news/claude-3-haiku) #multimodal

Vision

Anthropic

Claude 3 Opus(20240229)

Claude 3 Opus is Anthropic's most powerful model for highly complex tasks. It boasts top-level performance, intelligence, fluency, and understanding. See the launch announcement and benchmark results [here](https://www.anthropic.com/news/claude-3-family) #multimodal

Vision

Anthropic

Claude 3 Sonnet(20240229)

Claude 3 Sonnet is an ideal balance of intelligence and speed for enterprise workloads. Maximum utility at a lower price, dependable, balanced for scaled deployments. See the launch announcement and benchmark results [here](https://www.anthropic.com/news/claude-3-family) #multimodal

Vision

OpenAI

Meta: CodeLlama 70B Instruct

Code Llama is a family of large language models for code. This one is based on [Llama 2 70B](/models/meta-llama/llama-2-70b-chat) and provides zero-shot instruction-following ability for programming tasks.

Cohere

Cohere: Command

Command is an instruction-following conversational model that performs language tasks with high quality, more reliably and with a longer context than our base generative models. Use of this model is subject to Cohere's [Acceptable Use Policy](https://docs.cohere.com/docs/c4ai-acceptable-use-policy).

Open Source

Cohere

Cohere: Command R

Command-R is a 35B parameter model that performs conversational language tasks at a higher quality, more reliably, and with a longer context than previous models. It can be used for complex workflows like code generation, retrieval augmented generation (RAG), tool use, and agents. Read the launch post [here](https://txt.cohere.com/command-r/). Use of this model is subject to Cohere's [Acceptable Use Policy](https://docs.cohere.com/docs/c4ai-acceptable-use-policy).

Open Source

Cohere

ShieldGemma is a series of safety content moderation models built upon Gemma 2 that target four harm categories (sexually explicit, dangerous content, hate, and harassment). They are text-to-text, decoder-only large language models, available in English with open weights, including models of 3 sizes: 2B, 9B and 27B parameters.

Google's latest multimodal model, supporting image and video in text or chat prompts. Optimized for language tasks including: - Code generation - Text generation - Text editing - Problem solving - Recommendations - Information extraction - Data extraction or generation - AI agents Usage of Gemini is subject to Google's [Gemini Terms of Use](https://ai.google.dev/terms). *Note: Preview models are offered for testing purposes and should not be used in production apps. This model is **heavily rate limited**.*

Vision