Gemini API File Search Understands Images and Text: Google Adds Page Citations and Smart Filters

May 5, 2026 jarvis

Anthropic AI data center TPU compute infrastructure

Google is expanding Gemini API File Search with multimodal capabilities, custom metadata, and page-level citations. The new feature, built on the Gemini Embedding 2 model, allows searching documents and images simultaneously — in natural language — and every AI response can be backed up with an exact page number. Developers thus gain a tool that turns a RAG system into a trustworthy assistant, not a black box.

What the new version of File Search brings

Retrieval-augmented generation, or RAG for short, is a technique where an AI model doesn't just answer from what it "knows" from training, but actively searches for relevant information in an external knowledge base. Until now, however, most RAG systems were limited to plain text. That is changing.

Gemini API File Search now processes text and images together. Under the hood runs the Gemini Embedding 2 model, which can map different types of data — text, images, video, audio, and PDF documents — into a single semantic space. This means you can upload a PDF full of charts and photos into your knowledge base and ask about them in natural language. The model itself understands what is in the image without you having to preprocess or manually tag the data.

The new feature was officially announced on the Google AI Studio on X profile and comes less than a week after Gemini Embedding 2 was made generally available (GA).

Multimodal search: find images by mood

Until now, if you wanted to find a specific image in an archive, you had to rely on keywords or file names. With multimodal File Search, you describe what you're looking for in natural language — and the model finds images that visually match the description.

Imagine a creative agency that needs to pull photos with a certain emotional atmosphere or visual style from a twenty-year archive. Instead of browsing folders, just type: "find photos with warm autumn light and a melancholic mood" — and File Search returns relevant results. No complex tagging, no metadata.

Gemini Embedding 2, which powers all of this, can process up to 8,192 text tokens, 6 images, 120 seconds of video, 180 seconds of audio, and 6 PDF pages in a single call. And crucially — it supports more than 100 languages, so you can enter queries in Czech as well.

Custom metadata: end of the needle-in-a-haystack search

Uploading files to a database is easy. Finding the right one when you have thousands is already a challenge. Google is therefore adding custom metadata to File Search — key-value tags that can be attached to unstructured data.

In practice, this means you can mark documents with flags like department: Legal, status: Final, or year: 2025 and when querying, search only in a precisely defined subset of data. This significantly reduces noise, speeds up response time, and increases answer accuracy — which is essential especially in production RAG applications where millions of documents are processed.

For a Czech company or startup experimenting with RAG systems, this means easier scaling without the need to build your own filtering logic.

Page citations: every answer has its origin

One of the biggest pain points of RAG systems was that users couldn't easily verify where the AI got specific information from. File Search now solves this with page citations — the model automatically attaches a link to the original page in a PDF or other indexed document to each answer.

This level of granularity is crucial for trustworthiness. If the AI answers a legal question or searches for a technical specification, the user can open the source document at exactly the right place with one click. For industries such as law, healthcare, or banking — where factual accuracy and traceability are key — this means a shift from an "assistant" to a tool that can be trusted.

Who is already using it? First results from practice

In the announcement, Google showed three real-world deployment cases that illustrate the benefits of the new architecture:

Harvey, a legal research platform for law firms, recorded a 3% increase in Recall@20 accuracy on legal benchmarks compared to previous embeddings. The result is more precise citations and more reliable answers for lawyers.
Nuuly, a clothing rental company from URBN, deployed Gemini Embedding 2 for internal visual search. The system photographs unlabeled garments in the warehouse and searches for them in the catalog. Match@20 accuracy rose from 60% to nearly 87% and overall product identification success from 74% to over 90%.
Supermemory, a "vector memory database" tool, achieved a 40% increase in Recall@1 accuracy and uses embeddings across the entire retrieval pipeline — from indexing through search to Q&A.

Gemini Embedding 2: technical foundation

The Gemini Embedding 2 model is the first embedding model in the Gemini API that maps text, images, video, audio, and documents into a single semantic space. It supports over 100 languages and is trained using Matryoshka Representation Learning (MRL), which allows shortening the default 3072-dimensional vectors to smaller dimensions (1536 or 768 recommended) while maintaining high accuracy — and significantly lower storage costs.

For developers, a Batch API is also available, offering higher throughput at 50% of the standard price for embeddings. The model also supports so-called task prefixes — instructions that optimize the embedding for a specific task (for example, task: question answering or task: fact checking) and further increase search accuracy.

The entire solution is available through Gemini API as well as Gemini Enterprise Agent Platform on Google Cloud, which means you can choose between a more open developer approach and a fully enterprise solution with SLA and security guarantees.

What it means for Czech developers and companies

The Czech developer scene has long been close to Google tools — Gemini API is available worldwide including the Czech Republic and the European Union, without geographic restrictions. The model supports over 100 languages and in practice handles Czech as well, although official documentation does not yet explicitly list Czech localization as a separate item.

For Czech startups and companies building RAG systems over their own data (such as customer documentation, internal guidelines, or product catalogs), the new File Search brings several concrete advantages: the need for complex preprocessing of images and multimedia is eliminated, page citations increase the trustworthiness of outputs, and custom metadata facilitate scaling without losing accuracy.

From the perspective of European regulation, it is important that Google Cloud offers Gemini Enterprise Agent Platform with guaranteed data processing in the EU, which is important for companies that must meet GDPR requirements or the upcoming EU AI Act.

Is Gemini API File Search free?

Gemini API offers a free tier for experimenting with certain limits. For production deployment, you pay according to the volume of processed data — Gemini Embedding 2 has a standard pricing list and Batch API offers a 50% discount compared to the standard rate. Exact prices can be found in the official pricing on the Google AI for Developers pages.

Do I need to know how to program to use File Search?

For direct work with the API, basic programming knowledge is required (Python, JavaScript, or another supported language). However, Google AI Studio also offers a visual interface for rapid prototyping that does not require writing code — ideal for first experiments or validating an idea.

What is the difference between File Search and classic full-text search?

Full-text search looks for exact word or phrase matches. File Search with Gemini Embedding 2 works at the meaning level — it understands context, synonyms, and the visual content of images. Thanks to this, it finds relevant documents even when the exact expression isn't in them at all, and moreover can search images by their visual content, not just by captions.