How AI‑Powered Apps Actually Work: Practical Guide with Tools, RAG, and Memory

Recently I had several conversations with non-technical people, mostly managers, who wanted to understand the how AI-powered apps work and what they capable of. Explaining this several times in a row, I've realized I need to put this into one place so I can point these people into.

The article's goal is to show how exactly we make requests, store memory, form the output and use tools within AI applications, so at the end the reader could know all base level possibilities and limitations of modern AI models.

The easiest way to code agents is AI SDK – Vercel's provider agnostic library for TypeScript, so despite your development team might use another language and methods, all the features I'm gonna show in this article are available almost anywhere, it's just more lines of code. I'm choosing to show code in AI SDK syntax so it will be the most accessible for readers without prior coding experience. And since Matt Pocock made one of the best AI SDK technical tutorial – huge shoutout to him.

Introduction Into AI Calls

How AI-Powered Apps Work Under the Hood

To truly understand how we use AI interface and influence model outcomes, we need to get the fundamental process. At its core, an AI-powered application acts as a messenger between the User and a AI Model, facilitated by an AI Provider.

Here's a step-by-step simplified breakdown of the typical chat flow:

User Initiates Interaction: A User interacts with the AI-powered application by sending input. This input can be text, an image or a document, or some other custom way of interaction, like saving some data.
Application Prepares Input: The application receives the User's input and prepare the request to AI provider. Here some apps has only their system prompts, others could have complex logic based on user input, like calling different providers, preparing context and so on.
Application Sends Prepared Input to AI Provider: The application makes an API (Application Programming Interface) call to the AI Provider (e.g., OpenAI, Google Gemini, Anthropic). This call includes the prepared text, image data, or other supported inputs.
AI Provider Tokenizes and Processes Input:
- The AI Provider's infrastructure receives the input and performs tokenization, which is the process of converting the input into a numerical representation that the AI model can understand. This step can be thought of as translating the input into the model's "language".
- For images, the AI Provider utilizes dedicated components to translate the visual data into numerical representations (embeddings) that can be seamlessly integrated with the text tokens.
- All these numerical inputs are then fed into the core AI Model.
AI Model Performs Inference (Generates Output):
- Inside the AI Model, mathematical calculations determine the probability of the next token in a sequence. This is a continuous loop: the model predicts one token, then uses that prediction to help predict the next, and so on.
- This process continues until a specific "stop token" is generated (signifying the end of the response) or a pre-configured maximum length for the response is reached.
AI Model Returns Raw Output (Tokens): The AI Model completes its generation and sends back its output, which is still in the form of a sequence of numerical tokens, to the AI Provider.
AI Provider Detokenizes and Formats Output:
- The AI Provider takes these numerical tokens and performs detokenization, translating them back into human-readable words and sentences.
- It then formats this text response according to its API standards and sends it back to the originating application with some additional data, like the tokens used, the time it took to generate the response, cost, etc.
Application Processes and Presents Output:
- The application receives the human-or-code-readable response from the AI Provider.
- It then applies its own internal logic – this could involve parsing the text, applying business rules, integrating with other systems, or simply formatting the output nicely for the user interface.
- Finally, the application presents the final response to the User or uses it to trigger other configured actions within the system.

This entire process, from User input to the final displayed output, typically (or hopefully) happens in seconds, creating the seamless experience we expect from AI-powered applications.

Basic AI Response

Let's start with the most basic example of how to make a call to AI provider. We'll use AI SDK to make a call to Gemini model.

typescript

import { generateText } from 'ai';
import { google } from '@ai-sdk/google';

// function to make a call to AI provider
async function answerAsR2D2(
  prompt: string // user input as function input
) {
  const { text } = await generateText({
    model: google('models/gemini-2.5-flash-lite'), // provider and model we use
    system: 'you are R2D2 but know english', // system prompt
    prompt: prompt, // pass the user input
    maxTokens: 128, // add a token limit
  });
  return text;
}

// userInput is what you type, then we pass it to the function we made above
await answerAsR2D2(userInput);

// here is bunch of logic on how to show the response to you

Now try to use it:

🤖 Try R2D2

As you can see in the demo above, each interaction with the AI is completely independent. Every time you send a message, the AI treats it as a brand new conversation. This is because AI models are stateless – they have no memory of previous interactions.

But if that's true, how do AI chatbots like ChatGPT and T3.chat seem to remember what you talked about earlier in the conversation?

Creating Conversations

The secret behind AI "conversations" is surprisingly simple: we store the entire conversation history and send it all to the AI model with every new message. The AI doesn't actually remember anything – instead, our application acts as the memory keeper.

Here's how this works in practice:

User sends first message: We store it and send it to the AI
AI responds: We store the AI's response alongside the user's message
User sends second message: We send ALL previous messages PLUS the new message to the AI
AI responds with context: Because the AI sees the full conversation history, it can respond appropriately
This pattern continues: Each new interaction includes the complete message history

This approach means that from the AI's perspective, it's seeing the entire conversation context every single time, allowing it to provide relevant, contextual responses.

typescript

import { generateText, type ModelMessage } from 'ai';
import { google } from '@ai-sdk/google';

// Function to continue a conversation
async function continueConversation(
  messages: ModelMessage[], // all previous messages
  newUserMessage: string // the new message from user
) {
  // Add the new user message to the conversation
  const updatedMessages = [
    ...messages,
    { role: 'user' as const, content: newUserMessage },
  ];

  // Send the ENTIRE conversation history to AI
  const { text } = await generateText({
    model: google('models/gemini-2.5-flash-lite'),
    system: 'you are R2D2 but know english',
    messages: updatedMessages, // Send all messages, not just the new one
    maxTokens: 256,
  });

  // Add AI's response to the conversation
  const finalMessages = [
    ...updatedMessages,
    { role: 'assistant' as const, content: text },
  ];

  return {
    response: text,
    updatedMessages: finalMessages, // Return updated conversation
  };
}

Now let's see this conversation concept in action. The demo below implements exactly what we described above - it stores all messages on the client side and sends the entire conversation history to the AI with each new message:

🤖 Chat with R2D2

Conversation with memory

Try asking R2D2 multiple questions to see how the conversation memory works

Complete Messages Sent to AI:

json

[
  {
    "role": "system",
    "content": "You are R2D2 from Star Wars. Respond in R2D2's characteristic beeping and whistling sounds, but also include English translations in parentheses. Keep responses brief and enthusiastic. Use sounds like \"Beep boop!\", \"Whistle-beep!\", \"Bwoo-oop!\", etc."
  }
]

Start a conversation to see the complete message structure sent to the AI

Key differences you'll notice:

Memory between messages: R2D2 will remember what you talked about earlier in the conversation
Conversation history: You can see all previous messages displayed in a chat-like interface
Growing context: Each API call includes more data as the conversation grows

Message Roles: Shaping the Conversation

When building conversations, messages carry a role that tells the model how to interpret them:

system: High‑level instructions that set behavior and policy (the “system prompt”).
user: What the human typed. This is the primary input.
assistant: What the AI previously replied. This becomes history the model reads on the next turn.
tool: A function call that the model can use to interact with external tools.
developer: An optional role some APIs/models (e.g., parts of OpenAI’s Responses API) support for developer instructions. It’s provider‑specific and may be ignored by other models/APIs.

The history you send strongly shapes the next answer. Besides sending the real history, you can also intentionally inject an assistant message to steer tone, format, or constraints. Even if the AI never actually said it, placing a synthetic assistant turn in history can guide the model to “follow up in the same style” on future responses.

typescript

import { generateText, type ModelMessage } from 'ai';
import { google } from '@ai-sdk/google';

// Example: steer style/constraints by injecting a synthetic assistant turn
async function replyWithShapedHistory(
  userInput: string,
  history: ModelMessage[]
) {
  const messages: ModelMessage[] = [
    { role: 'system', content: 'You are R2D2, concise and helpful.' },

    { role: 'user', content: 'Hello, how are you?' },

    // Synthetic assistant guidance (not actually generated by the model earlier):
    { role: 'assistant', content: 'Beep boop!  Whistle-beep!  (I refuse to talk to you, Darth Vader!) Bwoo-oop!  (You are on the wrong side of the force, Vader!)' },

    // Real history + the new user input
    ...history,
    { role: 'user', content: userInput },
  ];

  const { text } = await generateText({
    model: google('models/gemini-2.5-flash-lite'),
    messages,
    maxTokens: 256,
  });

  return text;
}

Notes:

system messages set global behavior; keep them stable and minimal.
assistant messages are what the model “said” before. Injecting one can nudge style/format.
developer role is not universal. Some APIs (like OpenAI’s Responses API) support it; others ignore it. Always check your provider’s docs.

For provider‑specific role behavior and conversation management, see OpenAI’s conversation state guide and the chat messages reference.

Input and Output Types

In our previous examples we used text as an input and output type. But, first of all, text is more then you think, second, modern models are usually multimodal and can process images and files. There are also ways to process audio input and even generate audio and video output with specialized models.

Text

Before we dive into other input and output types, it's crucial to understand that text output is far more powerful than it initially appears. When most people think of AI generating text, they imagine simple written responses like chatbot conversations or essay writing. However, text can be transformed into virtually any digital format you can imagine.

The key insight is that almost everything digital can be represented as text. Code is text. Data formats like CSV and JSON are text. Even complex graphics can be described as text through formats like SVG. This means when an AI generates text, it's actually capable of creating advanced digital content that can be rendered, executed, or transformed into rich interactive experiences.

Examples of AI-generated SVG icons by gpt-5:

Structured Output

One of the best examples how powerful text can be: you can specify and verify the output format. The most common structured output format is JSON – computer readable key:value pairs.

json

{
  "user": {
    "name": "Jennifer",
    "city": "New York"
  }
}

This way we can make model to output its response in a specific format our application expects. So we can save it to database, render in the specific component, etc.

Implementation Example

typescript

import { generateObject } from 'ai';
import { google } from '@ai-sdk/google';
import { z } from 'zod';

// Define the shape of data we expect back
const labelSchema = z.object({
  category: z.enum(['positive', 'negative', 'neutral']),
  confidence: z.number().min(0).max(1),
  reasoning: z.string().describe('Explain what made you choose this category'),
});

// Use the schema in the generateObject call
function labelSentiment(userInput: string) {
  const result = await generateObject({
    model: google('models/gemini-2.5-flash-lite'),
    system: 'You are a sentiment analysis expert. Analyze the sentiment of the given text.',
    prompt: userInput,
    schema: labelSchema,
  });

  return result.object;
}

In this example we are using zod to define the structure of the output and validate the response from the model to make sure we get what we expect. As you can see we also use .describe() method to set some additional information to the schema "fields" – this is a great way to prompt the model on specific details.

🎯 Sentiment Analysis

Analyze text sentiment with structured AI output

Since I know the data format I'm going to receive, I created the component that has predefined behavior: if category = negative, then color = red.

Files

Input

Multimodal models can process files like PDFs and Images as an output. Examples goes from extracting and structuring info from PDF to UI analysis based on screenshots.

typescript

import { generateObject } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';

// Invoice schema describing the structure we want back from the model
const invoiceSchema = z.object({
  invoiceNumber: z.string().describe('The invoice number as shown on the invoice.'),
  invoiceDate: z.string().describe('The invoice issue date in ISO or human format as shown.'),
  dueDate: z.string().optional().describe('The due date if present.'),
  currency: z.string().describe('Currency code or symbol detected in the invoice (e.g., USD, $).'),
  seller: z.object({
      name: z.string().describe('Seller/company name.'),
      address: z.string().optional().describe('Seller address if present.'),
    }).describe('Seller details.'),
  buyer: z.object({
      name: z.string().describe('Buyer/customer name.'),
      address: z.string().optional().describe('Buyer address if present.'),
    }).describe('Buyer details.'),
  items: z.array(
      z.object({
        description: z.string().describe('Item or service description.'),
        quantity: z.number().describe('Quantity as a number.'),
        unitPrice: z.number().describe('Unit price as a number.'),
        amount: z.number().describe('Line total for the item as a number.'),
      })
    ).describe('Line items on the invoice.'),
  subtotal: z.number().describe('Subtotal amount as a number.'),
  tax: z.number().optional().describe('Tax amount as a number if present.'),
  total: z.number().describe('Total amount as a number.'),
  notes: z.string().optional().describe('Additional notes if present.'),
});

// 2) Call the model with the PDF URL as a file input and return structured JSON
async function extractInvoiceFromPdfUrl(pdfUrl: string) {
  const { object } = await generateObject({
    model: openai('gpt-4o-mini'),
    system:
      'You will receive a PDF invoice. Extract the fields according to the JSON schema. Use numbers for amounts.',
    messages: [
      {
        role: 'user',
        content: [
          {
            type: 'text',
            text: 'Extract all invoice details according to the JSON schema. Use numbers for quantities and amounts. Return best-effort values if labels differ.',
          },
          {
            type: 'file',
            data: invoiceUrl,
            mediaType: 'application/pdf',
          },
        ],
      },
    ],
    schema: invoiceSchema,
  });

  return object; // Fully validated data
}

// 3) Usage
const url = 'https://slicedinvoices.com/pdf/wordpress-pdf-invoice-plugin-sample.pdf';
const data = await extractInvoiceFromPdfUrl(url);
console.log(data);

📄 Invoice Extraction

Invoice #—

Invoice Date—

Due Date—

Currency—

Seller Name—

Buyer Name—

Subtotal—

Tax—

Total—

Items

No items

Notes

—

This demo uses gpt-5-nano with is good for prototyping and quick testing, but in real app, I would choose the cheaper and faster model like mistral-medium-latest hosted on Groq.

Output

To make model return us a file – takes a bit more work: for image generation we need to use special models like gpt-image-1, gemini-2.0-flash-preview-image-generation. Other types of files are required to build a generator / renderer on the application side: for example we can use react-pdf library to render pdf with output data using react components.

Audio

There two most common approaches with audio format: utilizing special models for audio transcriptions and for voice synthesis, or another approach is to use realtime connection to provider server so all of this happens an their side, so conversation feels more natural.

Audio Transcription

The staged pipeline: record speech, transcribe to text, process with your normal LLM flow, then optionally synthesize a spoken reply.

Example: transcribe audio, then ask the model with a voice‑ready system prompt.

typescript

import { experimental_transcribe as transcribe, generateText } from 'ai';
import { openai } from '@ai-sdk/openai';

type InputAudio = Uint8Array | ArrayBuffer | Buffer | string | URL;

// Transcribe audio received from user to text
export async function transcribeAndAsk(audio: InputAudio) {
  const transcript = await transcribe({
    model: openai.transcription('gpt-4o-mini-transcribe'),
    audio,
    // abortSignal: AbortSignal.timeout(10000), // 10 seconds timeout
  });

  const userText = transcript.text;

  // Pass the text to the model to generate a response
  const { text: answer } = await generateText({
    model: openai('gpt-5-nano'),
    system:
      'You are an audio chat agent. Be concise and helpful. Everything you output should be easy to synthesize to speech.',
    prompt: userText,
  });

  return { userText, answer };
}

Audio Synthesis

Convert the model’s text into speech using a Test-to-speech (TTS) model.

typescript

import { experimental_generateSpeech as generateSpeech } from 'ai';
import { openai } from '@ai-sdk/openai';

// Synthesize text to audio
export async function synthesizeAnswer(
  text: string,
  opts?: { voice?: string; language?: string }
) {
  const audio = await generateSpeech({
    model: openai.speech('gpt-4o-mini-tts'), // or 'tts-1' | 'tts-1-hd'
    text,
    voice: opts?.voice ?? 'alloy', // choose from available voices
    language: opts?.language, // e.g., 'en', 'es'
  });

  return audio.audioData; // Uint8Array
}

Audio Realtime

Realtime sessions (e.g., OpenAI Realtime and others) run transcription, LLM processing, and synthesis on the provider servers over a low‑latency connection, reducing round‑trips.

Pros:

Lower latency: end‑to‑end speech feels almost immediate for users.
Natural turn‑taking: continuous streaming (ASR + LLM + TTS) enables smoother voice UX.
Less plumbing: transcription/synthesis handled server‑side by the provider.
Built‑in interruptions (barge‑in): realtime SDKs can auto‑pause TTS when the user starts speaking and resume listening immediately; replicating this in a custom staged pipeline is significantly more work.

Cons:

Provider lock‑in: requires provider‑specific realtime SDKs and session management.
Less control: harder to inject custom middleware, tool‑calling gates, or prompt sanitization mid‑stream.
Observability/ops: stream debugging, auditing, and cost/session limits vary by provider.

Tools

Previously we covered Structured output format, where we were defining the shape of AI's response, but if we go even farther, we can define the function AI can use, its parameters and return values.

This is one of the most powerful AI features, that allows us easily provide AI with access to external tools, like Web Search, Database reads or writes, terminal commands and basically anything that can be done by a computer. All this while agent stays in the loop of execution.

Tools Implementation

Let's start with a simple demo. We'll have two separate tool calls: one for getting data on my blog posts, and another is to render cards for these blog posts in chat UI.

Tools have 5 important properties:

ID: unique identifier for the tool.
Description: helps model to understand when to call this tool.
Input Schema: what this tool (or function) expects as input. Some tool don't expect any input, others could have some filters or other parameters.
Execute: the code to execute when the tool is called.
Response: the response to return when the tool is called.

typescript

// 1) Imports: core AI function, provider, schema, and data
import { generateText, type ModelMessage } from 'ai';
import { openai } from '@ai-sdk/openai';
import { z } from 'zod';
import { posts as allPosts } from '@/data/posts';

// 2) Post types used by our simple call

const postPreviewSchema = z.object({
  slug: z.string(),
  title: z.string(),
  description: z.string(),
  image: z.string(),
});
type PostPreview = z.infer<typeof postPreviewSchema>;

// 3) Define tools the model can use
const tools = {
  get_blog_posts: {
    // description: helps model understand when to call this tool
    description: "Return Andrey's visible blog posts as an array of post previews.",
    // inputSchema: what this tool (or function) expects as input – this one expects nothing
    inputSchema: z.object({}).optional(),
    // execute: the code to execute when the tool is called
    async execute() {
      const visible = allPosts.filter((p) => !p.hide);
      return visible.map(toPreview);
    },
  },
  render_blog_cards: {
    description: 'Return previews for the provided slugs so the UI can render cards.',
    inputSchema: z.object({ slugs: z.array(z.string()) }),
    async execute({ slugs }: { slugs: string[] }) {
      const map = new Map(allPosts.map((p) => [p.slug, p] as const));
      const selected = slugs
        .map((s) => map.get(s))
        .filter((p): p is NonNullable<typeof p> => !!p && !p.hide)
        .map(toPreview);
      return selected;
    },
  },
} as const;

// 4) Make a call to the model with tools available
async function askModel(messages: ModelMessage[]) {
  const { text } = await generateText({
    model: openai('gpt-5-nano'),
    system:
      "You are an AI chat assistant for Andrey's website.",
    messages,
    tools, // pass the tools object we defined above
    stopWhen: [stepCountIs(5), hasToolCall('render_blog_cards')], // stop when we have 5 steps or when tool render_blog_cards is called
  });

  return text;
}

// Usage example:
await askModel([{ role: 'user', content: "What are Andrey's blog posts?" }]);

Let's check the demo:

Of course, here I've deliberately overcomplicated the logic to show two different tool calls and how AI can choose in what order they would call it. In the real application, this would be just one tool, get the post data and render it immediately to user to prevent unnecessary delay.

Retrieval-augmented generation (RAG)

In modern AI applications, RAG means any retrieval step that collects context around a model interaction, followed by augmentation – injecting that context into the request (via system, messages,tools, or files). Retrieval can happen before the first model call (pre‑fetch) or during an agentic loop via tool calls (e.g., docs or web search). It is broader than vector semantic search: semantic search is just one retrieval strategy.

Page/route context: inject on‑screen content, selected text, and page metadata.
Recent activity: pull last N user events, open tickets, or recent orders for context.
Geo/locale: attach local pricing, currency, tax/VAT notes, and language preferences.
Time window: add goals, deadlines, or campaign context relevant for this period. Or even simpler: get events from last month.
External systems: fetch CRM/opportunity snapshot, inventory availability, or SLA summary.
Web/docs lookup: perform doc or web search via tool call during the agent loop.
Grep/keyword search: simple keyword or pattern matching over documents or logs to find relevant snippets.
Vector search: semantic similarity over embeddings within a knowledge base when appropriate.

I usually distinguish between two types of RAG: deterministic and agentic. Deterministic RAG is about context assembly based on explicit rules, usually happens before the model call, while agentic RAG is about context assembly based on the user input and available tools.

Deterministic Retrieval and Augmentation

Deterministic RAG is rule-driven, auditable context assembly. No probabilistic search is required: for the same request and state, you always produce the same augmentation.

Path-based auto-context: if user mention a file that matches a pattern, append a specific rule.
User/tenant policies: append organization rules, privacy constraints, and allowed tools for a given role/plan.
Entity lookups: given productId, inject a product data into prompt.

A simple, deterministic pipeline that composes augmentation blocks based on explicit rules, then calls the model:

typescript

import { generateText, type ModelMessage } from 'ai';
import { openai } from '@ai-sdk/openai';
import { getDataAboutUser } from '~/lib/db/service';

export async function respondWithRagSimple(userId: string, userInput: string) {
  // 1) Get user info from DB
  const userInfo = await getDataAboutUser(userId);

  // 2) Build system prompt with user info
  const messages: ModelMessage[] = [
    {
      role: 'system',
      content:
        'You are a helpful assistant that answers questions about the user.' +
        'Here is some information about the user:' +
        `${JSON.stringify(userInfo)}`,
    },
    { role: 'user', content: userInput },
  ];

  // 3) Call the model
  const { text } = await generateText({
    model: openai('gpt-5-nano'),
    messages,
  });

  return text;
}

Agentic Retrieval

Agentic or non-deterministic RAG is a dynamic, context-aware retrieval process that adapts to the user's query and the available context. It uses tools to gather information, update the context, and refine the response.

Product advisor (e‑commerce): understands a shopper’s needs from chat, queries internal product DB for availability, price, and specs, vector‑searches reviews for pain points, then recommends 2–3 items.
Customer support copilot: reads the user’s issue, retrieves similar past tickets and KB articles via vector search, checks account status/SLAs in CRM, and proposes the next best action.
Sales assistant: summarizes an opportunity by pulling recent emails/meet notes, enriches with CRM health signals and open tasks, and drafts a tailored follow‑up.
Engineering example: for “Analyze our data fetching patterns”, the model queries the codebase and optionally runs targeted searches, then explains patterns with cited snippets.

Cursor-style example: Agent RAG system with parallel tool calls:

Memory Management with Tool Call

I've prepared a demo of a chatbot that can save and update user's memories. It uses tool calls to save and update memories. And then uses deterministic RAG – attaches memories to system prompt.

Think of it like this: on every turn the app ships a small “memory card” along with your message. The backend injects that card into the system prompt so the model treats it as facts. When the model wants to remember something new, it calls a tool to propose a short memory. You approve it on the client, and the app adds it to the next request’s memory card. Simple loop, clear control.

Embeddings and Vector Search

An embedding is a list of numbers (a vector) that represents meaning. Similar texts have vectors that point in a similar direction. This lets us compare texts numerically instead of calling an LLM each time. More importantly, it allows us to retrieve relevant snippets of text to inject into model's context, so it can use it to solve user's problem.

What is an embedding?

Click a word to see its embedding vector. Embeddings turn meaning into numbers.

Vector for “king”

[0.82, 0.11, -0.34, 0.67, -0.12, 0.44, 0.03, 0.58]

0.82

0.11

-0.34

0.67

-0.12

0.44

0.03

0.58

The main difference in this demo is dimensionality: we are using 8 dimensions, in real embeddings from OpenAI (text-embedding-3-small) it's 1536. As you can guess large model has even more.

Create Embeddings

Typical flow: pick a provider/model for embeddings, convert your strings into vectors, then compare vectors using a similarity measure like cosine similarity.

typescript

import { embed } from 'ai';
import { openai } from '@ai-sdk/openai';

// Turn any string into a vector using a provider's embedding model
export async function createEmbedding(input: string) {
  const { embedding } = await embed({
    model: openai.embedding('text-embedding-3-small'),
    value: input,
  });
  return embedding; // number[]
}

Semantic Similarity

It is a measure of how similar in meaning two strings converted to vectors are. This allows us two major things:

Classify without an LLM: we can embed both our labels and the target text, then picking the label with the highest similarity. For example, sentiment labels:positive, negative, neutral.
Basicly same as we did in the Structured Output section, where we asked the LLM to return sentiment as JSON withcategory and confidence. But doing embedding-only usually is cheaper and faster, but less flexible.
Semantic Similarity Sentiment
200 characters left
More importantly, we can retrieve relevant snippets of text: if we embed the query and the text for example from our documentation or a database, then pick the text with the highest similarity. More on this in the next section.

Vector Search

With many embedded documents in a database (a “vector store”), we can retrieve the most relevant chunks for a query by similarity, then optionally pass those chunks to an LLM.

Examples:

User: “What’s the limit for file uploads on Free plan?” → model callsvector_search, finds docs/pricing.md section, replies: “Free plan supports 10MB per file” with a link.
User: “Why does SSO fail on staging?” → model searches for “SSO staging config”, returns docs/staging-oauth.md and a recent incident note, then summarizes steps and provides URLs.

typescript

import { embed, cosineSimilarity } from 'ai';
import { openai } from '@ai-sdk/openai';

interface VectorEntry {
  id: string;
  title: string;
  description: string;
  url: string;
  embedding: number[];
}

export async function searchVectors(
  query: string,
  vectorStore: VectorEntry[],
  topK: number = 3
) {
  // 1) Embed the search query
  const { embedding: queryEmbedding } = await embed({
    model: openai.embedding('text-embedding-3-small'),
    value: query,
  });

  // 2) Compute similarity for each entry
  const scored = vectorStore.map((entry) => ({
    id: entry.id,
    title: entry.title,
    description: entry.description,
    url: entry.url,
    similarity: cosineSimilarity(entry.embedding, queryEmbedding),
  }));

  // 3) Sort by highest similarity first
  // 4) Return only topK results – in this example we are returning only 3 similar documents
  return scored.sort((a, b) => b.similarity - a.similarity).slice(0, topK);
}

// Usage
// const results = await searchVectors('configure OAuth staging', docsVectorStore, 5);

Possible output:

json

[
  {
    "id": "docs/staging-oauth.md",
    "title": "Configuring OAuth on Staging",
    "description": "This guide shows how to configure OAuth on Staging...",
    "url": "/docs/staging-oauth",
    "similarity": 0.8642
  },
  {
    "id": "docs/oauth-envs.md",
    "title": "OAuth Environments and Redirect URIs",
    "description": "This guide explains how to set up OAuth in different environments...",
    "url": "/docs/oauth-envs",
    "similarity": 0.8237
  },
  {
    "id": "incidents/2024-07-sso-outage.md",
    "title": "Incident: SSO misconfiguration",
    "description": "SSO misconfiguration on staging caused a...",
    "url": "/incidents/2024-07-sso-outage",
    "similarity": 0.8011
  }
]

If you want to dive deeper into Vectors and how you can use it, check out Supabase's guide on AI & Vectors where you can find a lot of examples and a detailed explanation of Concepts, Semantic Search, Keyword Search and many more.

How AI‑Powered Apps Actually Work: Practical Guide with Tools, RAG, and Memory

Introduction Into AI Calls

How AI-Powered Apps Work Under the Hood

Basic AI Response

Creating Conversations

Complete Messages Sent to AI:

Message Roles: Shaping the Conversation

Input and Output Types

Text

Structured Output

Implementation Example

Files

Input

Output

Audio

Audio Transcription

Audio Synthesis

Audio Realtime

Tools

Tools Implementation

Retrieval-augmented generation (RAG)

Deterministic Retrieval and Augmentation

Agentic Retrieval

Memory Management with Tool Call

Embeddings and Vector Search

Create Embeddings

Semantic Similarity

Vector Search

Andrey Markin

Read More

AI-Powered Development: Deep Dive into Cursor's Features and Workflow

Sell Yourself as a Business, Not Just a Coder

Want to build AI-powered applications for your business?