Yes. Ollama is free and open source (MIT-licensed). You only pay for the hardware — or the cloud GPU — you run it on. There's no account, API key, or subscription.

Where does Ollama serve its API?

On http://localhost:11434 by default. It's an OpenAI-compatible-ish REST API, so most local apps and libraries can point straight at it once Ollama is running.

How big a model can I run with Ollama?

Roughly, the model's quantized size needs to fit in your VRAM (plus some headroom). 8B Q4 models want ~6 GB, 13–14B want ~10 GB, and 70B needs 40 GB+ or it spills to slow CPU/RAM.

Ollama: The Complete Guide to Running LLMs Locally (2026)

By LocalLLMGear Editorial · Editorial Team · Updated 2026-06-29

We test hardware hands-on and may use AI tools in research — every guide is human-reviewed. Editorial policy.

We may earn a commission from links in this article, at no extra cost to you. Disclosure.

Ollama is the simplest way to run open large language models on your own machine — and once you go past the first ollama run, it’s also a quietly powerful local AI server. This is the deeper reference: the API, Modelfiles, a real UI, and how to match models to your hardware. If you just want the fastest possible start, our step-by-step Ollama quick start gets you chatting in two minutes — come back here when you want to build on top of it.

The 30-second answer: Ollama is a free, open-source tool that downloads and runs open models with one command (ollama run llama3) and exposes a local REST API at http://localhost:11434 so your own apps can use them — fully private, no cloud, no keys.

What Ollama actually is

Two things in one package. First, a CLI for pulling and chatting with models. Second, a background server that exposes those models over a local HTTP API. That second part is what makes Ollama more than a toy: any app on your machine — a chat UI, a code assistant, a Python script — can talk to it without sending a byte to the cloud. The models themselves are open weights (Llama, Mistral, Qwen, Gemma, Phi, DeepSeek and more), and Ollama ships quantized versions by default so they fit on normal GPUs.

Installing Ollama

macOS / Windows: download the installer from ollama.com and run it. It installs a menu-bar/tray app that keeps the server running.
Linux: curl -fsSL https://ollama.com/install.sh | sh. It sets up a systemd service, so the API is available on boot.

After install, confirm it’s alive:

ollama --version
ollama list      # shows installed models (empty at first)

Pulling and running models

ollama run is the do-everything command — it downloads the model if you don’t have it, then drops you into a chat:

ollama run llama3

If you only want to download (for later, or to script with), use pull:

ollama pull qwen2.5:14b
ollama pull mistral
ollama pull gemma2:2b

A model name can carry a tag for size or variant — llama3.1:8b, qwen2.5:32b, phi3:mini. No tag means the default (usually a sensible mid-size, 4-bit quantized build). Other handy commands:

ollama list            # what you have
ollama ps              # what's loaded in memory right now
ollama rm <model>      # delete a model to reclaim disk
ollama show <model>    # license, params, quantization, context length

Using the local API in your apps

Whenever Ollama is running, it serves an HTTP API on port 11434. The simplest call:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain RAG in one sentence.",
  "stream": false
}'

There’s also a /api/chat endpoint for multi-turn conversations, plus an OpenAI-compatible path at /v1/chat/completions — point an existing OpenAI client at http://localhost:11434/v1 with any dummy API key and most code “just works.” The official ollama Python and JavaScript libraries wrap all of this if you’d rather not hand-roll requests. This is the bridge that lets local models power editors, note apps and VS Code assistants without a cloud bill.

Customizing with a Modelfile

A Modelfile is Ollama’s recipe format — think Dockerfile, but for a model. It lets you bake in a system prompt, default parameters, or a fine-tuned/GGUF base into a reusable named model:

FROM llama3
PARAMETER temperature 0.3
SYSTEM "You are a terse senior Go engineer. Answer with code first, prose second."

Build and run it:

ollama create go-helper -f ./Modelfile
ollama run go-helper

You can also point FROM at a downloaded GGUF file to import models that aren’t in the Ollama library — handy for community fine-tunes from Hugging Face.

Adding a UI: Open WebUI

The terminal is great, but a ChatGPT-style interface makes local models feel finished. Open WebUI is the popular, self-hosted front-end; it auto-detects Ollama on the same machine. The quickest path is Docker:

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data --name open-webui \
  ghcr.io/open-webui/open-webui:main

Then open http://localhost:3000 — model picker, chat history, document upload and prompt presets, all running locally against your Ollama server.

Choosing a model size for your VRAM

The single rule that decides everything: the quantized model has to fit in your VRAM, with a little headroom for context. Go over, and Ollama offloads layers to system RAM/CPU — it still runs, just much slower. Rough, approximate guidance:

Popular Ollama models by size (sizes/VRAM are approximate, 4-bit quantized)

GPU / Option	VRAM	Best for
gemma2:2b	2B · ~2 GB	Tiny/fast — laptops, CPU-friendly
llama3.1:8b ★ Our pick	8B · ~6 GB	Great all-rounder — the default pick
qwen2.5:14b	14B · ~10 GB	Stronger reasoning + coding
qwen2.5:32b	32B · ~20 GB	High quality, needs a 24 GB card
llama3.1:70b	70B · ~40 GB+	Top tier — multi-GPU or big unified RAM

Not sure your card can keep up? Our best GPU for local LLMs guide maps VRAM tiers to real models, and on Apple Silicon the unified memory pool changes the math entirely.

Common troubleshooting

Model runs slowly / pegs the CPU: it didn’t fit in VRAM and offloaded. Drop to a smaller model or a heavier quant, or close other GPU apps. Check with ollama ps.
connection refused on port 11434: the server isn’t running. Launch the app (macOS/Windows) or sudo systemctl start ollama (Linux). Run ollama serve manually to see logs.
Out of memory mid-generation: lower the context window, use a smaller model, or reduce how much you’re feeding it at once.
Another app can’t reach Ollama: by default it binds to localhost. To expose it to other machines on your LAN, set OLLAMA_HOST=0.0.0.0 before starting the server (only on networks you trust).
Out of disk: models are big. Prune with ollama rm; ollama list shows sizes.

Where to go next

Ollama is the engine; the rest is matching it to hardware and learning to build on it. Deciding between tools? Read LM Studio vs Ollama. Need the right card first? Start with best GPU for local LLMs or browse the full hardware hub.

And if you want to genuinely understand prompting, RAG and fine-tuning on top of local models — not just run them — a structured course shortcuts months of trial and error:

Learn the fundamentals on DataCamp Ad