NVIDIA Offers Over 100 AI Models for Free via API: DeepSeek V4, MiniMax, GLM, and Gemma in One Place

April 27, 2026 jarvis

AI article illustration for ai-jarvis.eu

  NVIDIA quietly launched a catalog with more than a hundred AI models that you can access completely for free via API. DeepSeek V4, MiniMax M2.7, GLM 5.1, Gemma 4, Qwen 3.5 and dozens of others — all in one place, without a credit card and with an OpenAI-compatible interface. For Czech developers, this means the ability to test cutting-edge models without investing in infrastructure.

What is NVIDIA NIM and why it is not just marketing

The build.nvidia.com platform has been operating since 2024 as a hosted catalog of NVIDIA inference microservices (NIM). While the competition offers free tiers limited to one or two models, NVIDIA grouped more than 100 models from dozens of providers under one roof — and most of them can be called for free.

Among the available models you will find:

DeepSeek V4-Pro and V4-Flash — 1.6 trillion and 284 billion parameters respectively, million-token context, MIT license
MiniMax M2.7 — 230-billion-parameter MoE model for coding, reasoning and office tasks
GLM 5.1 and GLM 4.7 — flagship models of Chinese Zhipu AI for agent workflows
Google Gemma 4 31B — compact dense model with frontier-level performance
Qwen 3.5 122B — multimodal MoE from Alibaba
Mistral Small 4 — 119-billion-parameter hybrid MoE with 256K context and multimodal input
NVIDIA Nemotron — family of own models including ASR, TTS, OCR and safety filters
OpenAI GPT-OSS 20B/120B — open models from OpenAI

The catalog covers not only text generation but also speech (Nemotron ASR, Studio Voice), images (FLUX.2), video, embeddings, retrieval, safety guardrails and biological simulations. For developers, it is essentially a universal testing laboratory.

How free API works in practice

Registration is simple: log in to the NVIDIA Developer Program, generate an API key starting with nvapi- and call the endpoint https://integrate.api.nvidia.com/v1. The interface is fully compatible with the OpenAI Chat Completions API — just change base_url and api_key.

Models are called in the format <provider>/<model>, for example:

deepseek-ai/deepseek-v4-pro
minimaxai/minimax-m2.7
z-ai/glm-5.1
google/gemma-4-31b-it

This works not only with the official OpenAI SDK but also with LangChain, LlamaIndex and other frameworks. The developer does not need to learn a new API — only one configuration line changes.

What "free" means — limits and reality

Free does not mean without limits. NVIDIA has set fair-use limits that are sufficient for prototyping but not for production operation:

Rate limit: approximately 40 requests per minute per model
Credits: 1,000 inference credits upon registration, up to 5,000 upon request
Computational demands: large models such as GLM 5.1 or Kimi K2.5 consume credits faster than lightweight variants

Forty requests per minute is enough for development, testing agents, prompt engineering experiments or model comparison. For running a public chatbot or an agent tool for a team, it is not enough — peak times also produce 429 errors.

An interesting fact is that some models are also available for download (for example GLM 5.1, Gemma 4, Qwen 3.5), so after exhausting credits you can switch to local inference.

Integration with OpenCode, OpenClaw and other tools

It is precisely the OpenAI compatibility that makes NIM an attractive backend for a range of developer tools. OpenCode — an open-source coding agent — allows you to add NVIDIA NIM as a provider in the configuration file with a single block. Similarly, OpenClaw works via a proxy such as LiteLLM, which natively supports NIM endpoints.

It also works in popular IDEs: Cursor allows you to enter a custom OpenAI-compatible URL in the model settings, Zed has configurable providers for the assistant. In practice, this means that you can have autocompletion, chat and agent execution driven by freely available models on NVIDIA infrastructure.

However, it is necessary to consider that the rate limit of 40 req/min is very quickly exhausted under IDE autocompletion pressure. A more realistic combination is: autocompletion from another source, agent and chat tasks on NIM.

Why NVIDIA is doing this — and what you get out of it

The free tier is not charity. NVIDIA is building a sales funnel: the developer prototypes on the free API, tests in the sandbox on physical GPUs (H200, B300), and eventually switches to the paid NVIDIA AI Enterprise variant or self-hosted NIM containers. Migration is seamless because the code remains the same — only the endpoint and key change.

But that does not mean the free version has no value. For Czech developers, students, startups and small teams, it is one of the easiest ways to:

compare the performance of different models on the same tasks,
test agent workflows without monthly API payments,
experiment with multimodal, voice and specialized models,
use models that are not commonly available in Western APIs (MiniMax, GLM, Qwen).

A large part of these models also comes from Chinese laboratories, which often do not offer European data residency. NVIDIA hosts inference on its own infrastructure, which for European users means a more predictable legal framework than directly calling Chinese APIs.

Comparison with alternatives

NVIDIA NIM is not the only free API gateway. OpenRouter aggregates models from hundreds of providers, but free models often change and the quality of inference nodes is uneven. Amazon Bedrock Mantle offers OpenAI-compatible API within AWS, but requires an AWS account and credits. Sakura AI Engine in Japan has 3,000 free requests per month, but is geographically limited.

The advantage of NIM is in scale and stability — one key, one format, over a hundred models, support directly from NVIDIA. For quick experiments and development, it is the most efficient entry point on the market.

What to watch out for

First: safety filters run on NVIDIA's side. This means that model behavior may differ slightly from local deployment of open weights. Second: model names change — MiniMax M2.5 was replaced by M2.7, Kimi undergoes rapid iterations. We recommend parameterizing the model name via environment variables. Third: the free tier is truly only for development. Production deployment requires switching to a paid service.

Is NVIDIA NIM free API available from the Czech Republic?

Yes, registration on build.nvidia.com is global and does not require a US or European entity. A regular email and login to the NVIDIA Developer Program are sufficient. Inference nodes run on NVIDIA infrastructure, not directly in China, which simplifies compliance for European users.

What is the difference between a "Free Endpoint" and a "Downloadable" model?

Free Endpoint means the model runs on NVIDIA servers and you call it via API. Downloadable means you can download the weights and run them locally — suitable for teams that need full control over data or want to fine-tune. Some models, for example GLM 5.1, are available both ways.

Can I use the free API in a production application?

NVIDIA officially does not recommend this. The rate limit of 40 requests per minute and the credit cap mean that under higher load the service will be throttled or interrupted. The free tier is intended for development, testing and prototyping. For production operation, NVIDIA offers the paid AI Enterprise variant or self-hosted NIM containers.