Install Ollama & Local LLMs | Crosshill Design

01 / Introduction

What is Ollama and Why Should You Care?

You've probably heard the hype: run AI models locally, no API keys, no cloud costs, fully offline. That's Ollama. It's a tool that makes running large language models (LLMs) on your own hardware dead simple. Install it, pull a model, start chatting. That's it.

This tutorial assumes you're coming in cold—no deep learning background, no AI experience. Just curiosity and some decent hardware. By the end, you'll have a working local LLM and understand what you're actually doing (no black boxes). Whether you're on a laptop, mini PC, or server, the concepts are identical—only the speed varies.

Why Ollama Matters

Privacy: Your prompts stay on your laptop. No OpenAI, no Anthropic, no third party. Your data is yours.

Cost: Free. Download once, run forever. No $20/month subscriptions, no pay-per-token API fees.

Speed: Local inference is fast. No network latency waiting for cloud API responses. Your laptop responds instantly.

Offline: Internet down? Your LLM still works. Perfect for training, experimentation, or just using AI when connectivity is spotty.

What You'll Actually Learn

✓ Install Ollama on Linux, Mac, or Windows
✓ Understand which models fit your hardware (spoiler: lots of them)
✓ Run your first LLM and generate text
✓ Use Ollama's REST API for programmatic access
✓ Monitor performance and understand what's happening under the hood
✓ Manage multiple models simultaneously

Teaser: Most modern CPUs can run 7B-parameter models at 8–20 tokens per second. That's genuinely fast. You likely don't need a GPU. If you have 8GB+ RAM and a multi-core CPU, you have what you need. Let's prove it.

What This Is NOT

This tutorial is deliberately beginner-focused. We're not covering:

Model fine-tuning or training (that's Tutorial 2+)
Deep learning or transformer internals
Advanced optimization or quantization
Production deployment or scaling

What we ARE covering: Getting Ollama running and understanding it works. That's the goal.

By the End: You'll have a local LLM running, you'll understand how it works, and you'll be ready for Tutorial 2 (Advanced) or integration with OpenClaw.

02 / System Requirements & Model Selection

Does Your Hardware Cut It? (Spoiler: Yes)

Before we install, let's talk hardware. The good news: Ollama runs on almost any modern CPU. The question isn't "Can I run it?" but "How fast will it run?" That depends on your CPU cores, RAM, and storage speed. Let's figure out what to expect — and pick the right model for your machine.

Minimum Hardware Requirements

To Run Ollama

CPU: Any modern multi-core processor (Intel, AMD, Apple Silicon)
RAM: 8GB minimum (16GB+ recommended for 7B models with headroom)
Storage: 20GB+ free (models range from ~2GB to ~9GB each)
OS: Linux (Ubuntu 24.04+), macOS, or Windows with WSL2

If you have these, you can run Ollama. The key variable is speed — which depends on your CPU cores and RAM bandwidth.

Model Selection: 2026 Recommended Picks

The model landscape has moved fast. Llama 2 and Neural Chat are gone from the recommended list — there are better, faster, smarter options that run just as well on consumer hardware. Here are the three best starting points:

Llama 3.2 3B (4-bit)

← Start Here

Disk: ~2GB
RAM Used: ~3GB
Speed (8-core CPU): 15–30 tokens/sec
Quality: Excellent for its size
Why: Instant responses, works on 8GB RAM, great for first test

Qwen2.5:7b (4-bit)

All-Rounder

Disk: ~4.7GB
RAM Used: ~5.5GB
Speed (8-core CPU): 8–15 tokens/sec
Quality: Excellent — punches above weight
Why: 128K context window, great at code + reasoning

Mistral:latest (4-bit)

Proven Workhorse

Disk: ~4.1GB
RAM Used: ~5GB
Speed (8-core CPU): 8–15 tokens/sec
Quality: Very good for reasoning and code
Why: Battle-tested, huge community, reliable

Got 16GB+ RAM? Consider phi4 (Microsoft's 14B reasoning model that rivals much larger models) or gemma3:12b (Google's efficient series, great quality-per-GB). Both pull the same way: ollama pull phi4. Worth the wait if you have the RAM.

Understanding Quantization Tags

When you browse ollama.com/library, you'll see tags like q4_0, q4_K_M, and q8_0 next to model names. These are quantization levels — how much the model has been compressed to fit in memory. Here's what they mean:

q4_0smallest

4-bit quantization, aggressive compression. Smallest file size and lowest RAM usage. Slight quality tradeoff on nuanced reasoning tasks. Good starting point if storage is tight.

q4_K_Mbest balance

4-bit with K-means quantization — smarter compression that preserves quality better. This is usually what :latest resolves to. The sweet spot for most users. Same size savings as q4_0 but noticeably better output on complex prompts.

q8_0near-full quality

8-bit quantization. Nearly indistinguishable from unquantized. Roughly twice the RAM of a q4 model. Use this when quality matters most and you have the headroom. A Qwen2.5:7b-q8_0 uses ~8GB RAM vs ~5.5GB for q4_K_M.

💻 Pull a specific quantization

# :latest usually resolves to q4_K_M (recommended)
ollama pull qwen2.5:7b

# Explicitly request q8 for better quality (more RAM)
ollama pull qwen2.5:7b-q8_0

# 1B model — tiny and fast for simple tasks
ollama pull llama3.2:1b

For most users, :latest is the right call. Ollama picks a good default. Only specify a quantization tag if you're optimizing for a specific RAM budget or quality ceiling.

What NOT to Run (On Low RAM)

Some models need more headroom than others. If you're on 8GB total RAM, stick to 3B models:

RAM Guide — What Fits Where

8GB RAM: 1B–3B models comfortably (llama3.2:1b, llama3.2:3b). 7B models are tight — other apps may push you into swap.
16GB RAM: 7B models comfortably. Can experiment with 12–14B (phi4, gemma3:12b). Two 7B models loaded simultaneously.
32GB RAM: 14B models and below comfortably. Multiple concurrent 7B models. Some 32B quantized models with patience.

If you're on 8GB and a 7B model feels sluggish, try llama3.2:3b instead — it's genuinely impressive for its size and will be noticeably more responsive.

Hardware Tiers & What to Expect

Speed varies dramatically based on your CPU. Here's what to expect running a 7B model (q4_K_M):

💻 Performance by CPU Class (7B Model, q4_K_M)

Budget / Older CPU (4 cores, 8 threads):   3–6 tokens/sec (usable, slower)
Mid-Range CPU (8 cores, 16 threads):      8–15 tokens/sec (very good!)
High-End CPU (12+ cores, 24+ threads):    15–25+ tokens/sec (excellent)
Apple Silicon M2/M3:                      30–50+ tokens/sec (fast as GPU)

Note: Tested on systems with 16GB+ RAM and NVMe SSD storage

To put it in perspective: 10 tokens/second means a 500-token response (a long paragraph) takes about 50 seconds. That's fast for CPU-only inference. You'll be pleasantly surprised.

Bottom Line: Start with llama3.2:3b — it's fast, free, and works on any modern machine. Then pull qwen2.5:7b when you're ready for something more capable. Both will surprise you.

03 / Installation

Getting Ollama Running

Ollama installation is genuinely simple. No compilation, no complex setup, no configuration files to fiddle with. Download, run the installer, done. Let's do it.

Prerequisites

✓ Linux OS (Ubuntu 24.04+, Fedora, Debian, etc.)
✓ 8GB+ RAM (16GB+ recommended for multiple models)
✓ 20GB+ free disk space (models are ~4GB each)
✓ Stable internet connection (for downloading models once)
✓ Ability to run sudo commands (or use your user password)

Step 1 — Download and Install Ollama

Open a terminal and run:

🖥️ Mini PC

curl -fsSL https://ollama.ai/install.sh | sh

This script:

Downloads the Ollama binary for your system
Places it in `/usr/local/bin/` (in your PATH)
Sets up a systemd service to auto-start on boot
Creates the ollama user and group

The installation takes 1–2 minutes. You'll see output as it progresses. When it finishes, you're done.

Step 2 — Verify Installation

Check that Ollama is installed and in your PATH:

🖥️ Mini PC

ollama --version

You should see something like:

🖥️ Mini PC — Output

ollama version is 0.1.45 (or newer)

Step 3 — The Daemon is Already Running

During installation, the Ollama daemon started automatically in the background. It's listening on http://127.0.0.1:11434.

Do not run ollama serve manually — the port is already in use by the running daemon. You can verify it's active by testing the API:

🖥️ Mini PC

curl -s http://127.0.0.1:11434/api/tags

If it responds with JSON (like {"models":[]}), the daemon is running. Good to go!

Step 4 — Three Ways to Interact with Ollama

Now that the daemon is running, here's how you can use it:

Three Modes of Operation

ollama run <model>
Interactive chat mode. Type prompts, get responses in your terminal. Great for testing and learning.
REST API (curl/Python/etc)
Programmatic access. Send JSON requests to localhost:11434, get JSON responses. Perfect for integrations and scripts.
Background daemon
The daemon runs automatically on boot and stays running. You don't manage it manually; it just works.

Step 5 — Ready for Your First Model

The daemon is running and the API is responding. Installation is complete.

You're now ready to download and run your first model. Head to the next section to pull Mistral 7B.

Installation Summary

Ollama is installed and running as a background service on http://127.0.0.1:11434. No configuration needed. No manual daemon management. Just pull a model and start using it.

You can now:

Pull models with ollama pull <model>
Run interactive chat with ollama run <model>
Make API calls to http://127.0.0.1:11434 from scripts
Start using local AI immediately — no API keys, no cloud, no costs

Installation Complete: Ollama is running and ready. Next: download and run your first model.

04 / Your First Model

Download and Run Mistral 7B

Time for the moment of truth. We're going to download Mistral 7B (a smart, fast model) and run our first interactive chat session. This is where the magic happens.

Step 1 — Pull Mistral

"Pulling" a model means downloading it from Ollama's registry and storing it locally. Run:

🖥️ Mini PC

ollama pull mistral

You'll see output like:

🖥️ Mini PC — Output

pulling manifest
pulling 418956b73c34... (downloading layer 1)
pulling e1cd8f6a5d4a... (downloading layer 2)
verifying sha256 digest
writing manifest
success

The full model is about 4GB. On a typical internet connection, this takes 3–5 minutes. Grab a coffee.

Step 2 — Run Interactive Chat

Once the download completes, start an interactive chat session:

🖥️ Mini PC

ollama run mistral

You'll see the prompt appear with the model ready:

🖥️ Mini PC — Output

>>>

(You can see which version of Mistral loaded by checking what was pulled. Run ollama list to see all installed models and their exact versions.)

Now type a question or statement. Let's try something simple:

💻 Your Input

>>> What is Ollama?

Watch as the model generates a response in real-time. You'll see tokens appearing one by one. This is your local LLM doing inference right now, on your CPU, with no API calls, no cloud, no tracking.

Step 3 — Try More Prompts

Keep the chat session open and try different questions. Here are some good ones to test:

💻 Example Prompts

>>> Explain machine learning in simple terms
>>> Write a Python function that checks if a number is prime
>>> What's a good name for a Discord bot?
>>> Why is the sky blue?
>>> Tell me a joke

Notice:

Response speed (you'll see "14 tokens/sec" or similar at the end)
Response quality (is it coherent? Accurate?)
CPU usage (all your cores working hard)
No waiting for external APIs

Step 4 — Exit the Chat

To exit the interactive session, type:

🖥️ Mini PC

>>> /bye

Or press Ctrl+D. Either works.

What Actually Happened Here?

Let's be concrete about the workflow:

Pull: Download the model (~4GB) to ~/.ollama/models
Load: When you run Mistral, Ollama loads it into RAM (~4GB used)
Inference: Your prompt goes to the model, which generates tokens one at a time
Display: Each token appears on your screen as it's generated
Repeat: You type, the model responds, until you exit

Performance Notes

The interactive mode doesn't display detailed timing information directly. However, Ollama's REST API provides complete timing metrics including generation speed, which you'll explore in the next section.

For now, what you can observe from the interactive experience:

The response appears token-by-token in real-time
You can estimate speed by watching how fast tokens appear (roughly 10–15 tokens/sec on mid-range CPUs is typical)
The model loads into memory the first time you run it (notice a slight delay before responses start)
Subsequent responses should be faster since the model stays loaded

In the next section, you'll use the REST API to make requests and see exact timing metrics (tokens/sec, total duration, load time, etc.) in JSON responses. That's where you get precise performance data.

You Did It: You have a working local LLM generating intelligent responses. Now let's see the timing metrics and learn to access Ollama programmatically via its REST API.

05 / REST API Basics

Programmatic Access to Your LLM

Interactive chat is fun for testing, but the real power comes from using Ollama's REST API. This lets you send prompts programmatically and get responses as JSON. Perfect for integrating with OpenClaw, scripts, or custom applications.

How It Works

Ollama runs a simple HTTP server on localhost:11434. You send JSON requests, you get JSON responses. No authentication, no setup. Just HTTP.

This is how you'll integrate Ollama with OpenClaw later. For now, let's test it with curl.

Step 1 — Simple Text Generation (Synchronous)

Open a terminal and run:

🖥️ Mini PC

curl http://localhost:11434/api/generate \
  -d '{
    "model": "mistral",
    "prompt": "Explain quantum computing in 2 sentences",
    "stream": false
  }'

The -d flag sends JSON data in the request body. The stream: false means "wait for the full response before returning."

You'll get a JSON response that looks like:

🖥️ Mini PC — JSON Response

{
  "model": "mistral",
  "created_at": "2026-02-22T10:30:45.123456Z",
  "response": "Quantum computing exploits quantum mechanics (superposition and entanglement) to process data in fundamentally different ways than classical computers. A quantum computer can explore many solutions simultaneously, making certain problems exponentially faster to solve.",
  "done": true,
  "total_duration": 2500000000,
  "load_duration": 300000000,
  "prompt_eval_count": 12,
  "prompt_eval_duration": 800000000,
  "eval_count": 30,
  "eval_duration": 1400000000
}

Key fields:

response: The generated text (what you want)
done: Whether generation is complete
eval_count: Number of tokens generated
eval_duration: Time spent generating (nanoseconds)

Step 2 — Parse with jq (Optional But Nice)

The full JSON response includes several fields you may not need, including a context array (used for multi-turn conversations). If you have jq installed, you can extract just what you want:

🖥️ Mini PC — extract response

curl -s http://localhost:11434/api/generate \
  -d '{
    "model": "mistral",
    "prompt": "Tell me a joke",
    "stream": false
  }' | jq -r '.response'

The -r flag means "raw output" (no quotes around the text). You'll just see:

🖥️ Mini PC — Output

Why don't scientists trust atoms? Because they make up everything!

Or extract key fields without the context clutter:

🖥️ Mini PC — response + timing

curl -s http://localhost:11434/api/generate \
  -d '{
    "model": "mistral",
    "prompt": "Tell me a joke",
    "stream": false
  }' | jq '{response, eval_count, eval_duration}'

Much cleaner. Install jq if you don't have it: sudo apt install jq

Note on the context field: The actual API response includes a context array containing token IDs from your prompt and response. This is used for multi-turn conversations (sending context back to maintain conversation history). For single requests, you can safely ignore it or filter it out with jq as shown above.

Step 3 — Streaming API (Real-Time Responses)

For longer responses, you might want tokens to stream in real-time (like in interactive chat). Set stream: true:

🖥️ Mini PC

curl http://localhost:11434/api/generate \
  -d '{
    "model": "mistral",
    "prompt": "Write a haiku about programming",
    "stream": true
  }'

With streaming, you get multiple JSON objects (one per token), streamed line-by-line:

🖥️ Mini PC — Output (Streaming)

{"model":"mistral","created_at":"...","response":"Code","done":false}
{"model":"mistral","created_at":"...","response":" flows","done":false}
{"model":"mistral","created_at":"...","response":" like","done":false}
...
{"model":"mistral","created_at":"...","response":"","done":true}

Parse each line and print the response field to see tokens appear in real-time. This is how chat interfaces work.

Step 4 — API Parameters

You can pass additional parameters to control generation behavior:

🖥️ Mini PC — Advanced Example

curl http://localhost:11434/api/generate \
  -d '{
    "model": "mistral",
    "prompt": "Complete this: The future of AI is...",
    "stream": false,
    "temperature": 0.7,
    "top_p": 0.9,
    "num_predict": 100
  }'

Parameters explained (you'll dive deeper in Tutorial 2):

temperature: Randomness (0.0=deterministic, 1.0=balanced, >1.5=creative)
top_p: Diversity control (lower=more focused)
num_predict: Max tokens to generate (prevents runaway responses)

For now, the defaults (no parameters) are fine. We'll explore these in Advanced.

Why This Matters for Integration

When you integrate Ollama with OpenClaw, this is what happens under the hood:

OpenClaw receives a user message from Discord
It constructs a JSON request to your local Ollama API
Ollama generates a response
OpenClaw parses the response and sends it back to Discord

All locally. All on your laptop. All in milliseconds. That's powerful.

API Documentation: For a complete list of parameters and endpoints, check Ollama's API docs.

API Ready: You now know how to use Ollama programmatically. Next: understanding performance and managing multiple models.

06 / Performance & Resources

Monitoring and Understanding Resource Usage

Now that you have Ollama running, let's look at what's actually happening under the hood. How much memory is it using? How hard is your CPU working? What can you realistically expect?

Monitor While Running

While Ollama is generating text, open another terminal and watch resource usage:

🖥️ Mini PC — Terminal 2

watch -n 1 'free -h'

This updates every second showing your RAM usage. A typical 7B model uses about 5–6GB.

🖥️ Mini PC — Output

              total        used        free      shared  buff/cache   available
Mem:           31Gi       6.2Gi        22Gi       1.3Gi      3.4Gi        23Gi

This example shows Ollama using ~6GB out of 31GB total—plenty of breathing room. Your system will vary.

For CPU usage, in another terminal:

🖥️ Mini PC — Terminal 3

top -n 1 -o %CPU

Or use htop for a nicer interface:

🖥️ Mini PC

htop

During inference, you'll see all your CPU cores at high utilization (70–95%). This is normal and expected. Your CPU is working hard, which is why you get good performance.

Performance Benchmarks by Hardware

Here's what to expect for a single Mistral 7B model on different hardware. All numbers assume 16GB+ RAM and SSD storage:

💻 Mistral 7B Performance Ranges

Budget / Older CPU (4 cores):
  Tokens/Sec:         3–6 tokens/sec (slow but usable)
  Time to First Token: 1–2 seconds
  RAM Used:           ~6GB

Mid-Range CPU (8 cores):
  Tokens/Sec:         8–15 tokens/sec (very good)
  Time to First Token: 500ms–1 second
  RAM Used:           ~6GB

High-End CPU (12+ cores):
  Tokens/Sec:         15–25+ tokens/sec (excellent)
  Time to First Token: 300–500ms
  RAM Used:           ~6GB

Universal:
  Model Size on Disk: ~4GB
  Model Load Time:    1–3 seconds (SSD-dependent)
  CPU Utilization:    70–95% during generation

Where you land on this spectrum depends on your CPU core count and clock speed. The good news: even budget CPUs generate text at usable speeds.

What Affects Performance?

Token generation speed varies based on several factors:

Performance Factors

Prompt Length: Longer prompts take longer to process before generating
Response Length: More tokens to generate = longer total time (linear relationship)
System Load: Other apps running = fewer CPU cycles for Ollama
Model Size: Bigger models (13B+) are much slower on CPU
SSD Speed: Slow first model load if SSD is bottleneck (yours is fast)

Is 10–15 Tokens/Sec Fast Enough?

Let's put it in perspective:

💻 Response Time Examples

Short Answer (50 tokens):       ~3–5 seconds
Medium Answer (200 tokens):     ~13–20 seconds
Long Answer (500 tokens):       ~33–50 seconds
Full Essay (1000 tokens):       ~65–100 seconds

For comparison:

OpenAI's API: Also 10–20 tokens/sec, but costs money and requires internet
Claude/GPT directly: Similar speed with 100x the cost
Instant messaging apps: Slower (typing speed is 40–60 words/min = ~10 tokens/sec)

So yes, 10–15 tokens/sec is genuinely fast. You're getting good performance locally, for free.

Sustained Running (24/7 Concerns)

Most hardware can run Ollama continuously without issues:

Thermals: CPU inference doesn't generate excessive heat. Modern cooling handles it fine.
Battery Drain: On laptops with battery, expect 1–2 hours per charge during heavy continuous use.
Reliability: Modern CPUs are designed for sustained loads. No degradation over time.
Memory Leaks: Ollama is stable. No memory creep after hours of running.

For occasional or development use on a laptop/desktop, your current hardware is fine. If you want to run Ollama 24/7 at scale with multiple concurrent requests, consider a dedicated server or device with more consistent power delivery.

Performance Baseline Set: You now understand what your CPU can do and what to expect. Now let's see how to manage multiple models.

07 / Managing Multiple Models

Download, List, and Switch Between Models

One model is useful. A library of models is powerful. Different models have different strengths — a small 3B model for quick queries, a 7B all-rounder for complex work, a reasoning model when you need to think things through. Ollama makes managing all of them trivial.

Step 1 — Download More Models

Let's pull two more models to build out your library. The first is Qwen2.5:7b — an excellent all-rounder with a massive 128K context window, great for long documents and complex tasks:

🖥️ Mini PC

ollama pull qwen2.5:7b

And the tiny-but-capable Llama 3.2 1B — useful when you want an instant response and don't need heavy reasoning:

🖥️ Mini PC

ollama pull llama3.2:1b

The 1B model is only ~1.3GB and loads in seconds. It's your speed tier — great for quick lookups and simple tasks when you don't want to wait for a 7B model to warm up.

Step 2 — List Your Models

See everything you've downloaded:

🖥️ Mini PC

ollama list

Output:

🖥️ Mini PC — Output

NAME                 ID              SIZE    MODIFIED
qwen2.5:7b           845dbda0ea48    4.7 GB  2 hours ago
mistral:latest       f974a74358d6    4.1 GB  3 hours ago
llama3.2:1b          baf6a787fdff    1.3 GB  1 hour ago

Three models, ~10GB total. The key is they cover different use cases — speed, quality, and proven reliability.

Step 3 — See What's Actually Running

ollama list shows what's downloaded. ollama ps shows what's currently loaded in memory — this is the command you want when troubleshooting performance or wondering why your RAM is full:

🖥️ Mini PC

ollama ps

🖥️ Mini PC — Output

NAME             ID              SIZE      PROCESSOR    UNTIL
qwen2.5:7b       845dbda0ea48    5.5 GB    100% CPU     4 minutes from now
mistral:latest   f974a74358d6    5.0 GB    100% CPU     Expires in 2 minutes

PROCESSORcolumn

Shows how the model is being processed. 100% CPU means pure CPU inference. On systems with a compatible GPU, you'll see something like 78%/22% CPU/GPU, meaning the GPU is offloading part of the work. More GPU % = faster.

UNTILcolumn

When Ollama will unload this model from memory. By default, models unload 5 minutes after their last request. "Forever" means the model is pinned in memory indefinitely (see OLLAMA_KEEP_ALIVE below). Once unloaded, the next request reloads the model — a cold start takes a few seconds.

Step 4 — Control Keep-Alive (OLLAMA_KEEP_ALIVE)

By default, Ollama unloads a model from RAM 5 minutes after it was last used. This is good for shared machines and 8GB systems. But if you're the only user and have the RAM, keeping models loaded means instant responses with no cold-start delay.

🖥️ Mini PC — Set keep-alive duration

# Keep models loaded for 30 minutes of idle time
export OLLAMA_KEEP_ALIVE=30m

# Keep models loaded indefinitely (until you restart Ollama)
export OLLAMA_KEEP_ALIVE=-1

# Use the default 5-minute unload (default behavior)
export OLLAMA_KEEP_ALIVE=5m

# Apply it permanently (add to ~/.bashrc or ~/.zshrc)
echo 'export OLLAMA_KEEP_ALIVE=30m' >> ~/.bashrc
source ~/.bashrc

How to choose: On 16GB RAM, set keep-alive to 30m or longer — a loaded 7B model uses ~5GB and the performance gain is significant. On 8GB RAM, keep the default 5m so the model frees memory when you switch tasks.

To apply OLLAMA_KEEP_ALIVE to the Ollama systemd service (so it persists across reboots), add it to the service's environment:

🖥️ Mini PC — Persist via systemd

sudo systemctl edit ollama

📝 Add this block in the editor that opens

[Service]
Environment="OLLAMA_KEEP_ALIVE=30m"

🖥️ Mini PC — Reload and restart

sudo systemctl daemon-reload
sudo systemctl restart ollama

Step 5 — Switch Between Models

Switch is simple — just run a different model name:

🖥️ Mini PC

ollama run qwen2.5:7b

You're now in Qwen's interactive session. When done:

🖥️ Mini PC

>>> /bye

Then switch to the fast tier:

🖥️ Mini PC

ollama run llama3.2:1b

Step 6 — Run Different Models via API

With the REST API, you can specify which model to use per request. This is how OpenClaw routes different task types to different models:

🖥️ Mini PC — Heavy reasoning task → Qwen2.5

curl http://localhost:11434/api/generate \
  -d '{
    "model": "qwen2.5:7b",
    "prompt": "Analyze this error and suggest a fix...",
    "stream": false
  }'

🖥️ Mini PC — Quick lookup → Llama 3.2 1B

curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3.2:1b",
    "prompt": "What does HTTP 429 mean?",
    "stream": false
  }'

Ollama queues requests and processes them in order. On 16GB+ RAM with keep-alive set, both models can stay loaded and the switch between them is instantaneous — no unload/reload delay.

Current Popular Models Worth Trying

All of these are 4-bit quantized by default and pull directly from Ollama's library:

💻 2026 Model Picks

ollama pull llama3.2:1b       # 1.3GB — instant speed, simple tasks
ollama pull llama3.2:3b       # 2GB   — small and surprisingly capable
ollama pull mistral:latest    # 4.1GB — proven reasoning + code workhorse
ollama pull qwen2.5:7b        # 4.7GB — best all-rounder, 128K context
ollama pull phi4              # 9.1GB — Microsoft's reasoning model, needs 16GB RAM
ollama pull gemma3:12b        # 8.1GB — Google's efficient model, needs 16GB RAM

Start with llama3.2:3b and qwen2.5:7b. That covers 90% of use cases. Add phi4 or gemma3:12b if you have 16GB+ and want higher quality on complex reasoning tasks.

Disk Space Management

Models live in ~/.ollama/models. Check your usage:

🖥️ Mini PC

du -sh ~/.ollama/models

To remove a model and free disk space:

🖥️ Mini PC — remove a model

ollama rm llama3.2:1b

Re-pull anytime — Ollama uses layer caching, so re-downloading a model you've had before is faster than the initial pull. Models that share architecture (like Llama 3.2 1B and 3B) share base layers, so the second model is much smaller to download.

Memory Limits (Hardware Dependent)

Ollama loads models on-demand into RAM. Here's what you need to know:

One 7B model (q4_K_M): ~5–6GB RAM (plus OS overhead, so ~8GB total consumed)
One 3B model: ~2.5GB RAM — very comfortable on 8GB systems
Two 7B models loaded: ~12GB — comfortable on 16GB systems
One 14B model (phi4): ~9GB — needs 16GB, leaves headroom

Ollama gracefully unloads models you're not using when memory gets tight. With OLLAMA_KEEP_ALIVE set, models stay warm as long as you have the RAM to support them.

Multiple Models Ready: You now know how to manage a model library, check what's loaded with ollama ps, and tune keep-alive to match your hardware. Next: integrating Ollama with OpenClaw.

08 / Next Steps

Where to Go From Here

You have a working local LLM. You understand how to download models, run them interactively, access them via API, and manage multiple models. That's the foundation. Now: what's next?

Option 1: Dive Into Tutorial 2 (Advanced)

Ready to squeeze maximum performance and integrate with OpenClaw? That's the Advanced tutorial.

You'll learn:

Parameter Tuning: Temperature, top_p, context windows—control how the model behaves
Benchmarking: Measure and compare model quality and speed objectively
GPU Detection: Understand when and how to use GPU acceleration (spoiler: you don't need it)
OpenClaw Integration: Connect Ollama to your OpenClaw bot (local AI brain!)
Optimization: Squeeze the last bit of performance from your CPU with tuning and hardware-specific optimizations

Advanced is 90 minutes. Do it when you're ready to move beyond "just works" to "optimized."

Option 2: Integrate With OpenClaw Now

If you want to skip advanced tuning and go straight to using Ollama with OpenClaw, you can do that. You have everything you need:

✓ Ollama running (listening on localhost:11434)
✓ Model loaded (Mistral or Llama 2)
✓ REST API working (tested with curl)

Point OpenClaw to your local Ollama API instead of OpenAI. Your bot will have a local LLM brain. See the Advanced tutorial for integration details when you're ready.

Option 3: Experiment on Your Own

Try different models. Compare them. Run them with different parameters. Write scripts to benchmark. Build a tiny chatbot using Python + Ollama API.

Some ideas:

Write a Python script that sends prompts to Ollama and logs response times
Compare Mistral vs Llama2 on the same prompts (which is smarter?)
Build a simple CLI chatbot using ollama.py or requests library
Experiment with streaming API to build a live chat interface
Download a 13B model and benchmark it (for curiosity, even if slow)

Option 4: Learn More (Self-Study)

If you want to deepen your understanding:

Resources

Ollama GitHub
Source code, issues, community discussions
Hugging Face Model Hub
Thousands of models, descriptions, benchmarks
YouTube Tutorials
Video walkthroughs from the community
Transformers (Deep Dive)
If you want to understand the math behind LLMs

Common Next Questions

Q: Can I run this in the background permanently?
A: Yes! Ollama already auto-starts as a systemd service. It'll run on boot and keep running.

Q: Can I access Ollama from other machines (not localhost)?
A: Not by default (security). You'd need to expose the API with an nginx proxy or SSH tunnel (advanced).

Q: What if I want to upgrade to a bigger model later?
A: You can, but 13B+ models are significantly slower on CPU. Consider upgrading hardware to GPU.

Q: Is there a web UI for Ollama?
A: Not built-in, but the community has built web interfaces. Or use the CLI/API directly.

Q: Can I use Ollama on Windows/Mac?
A: Yes! Ollama has installers for all platforms. Same concepts apply.

You're Ready

You've completed Ollama Basics. You understand:

✓ What Ollama is and why it matters
✓ How to install and verify it works
✓ How to download and run models
✓ How to use the REST API programmatically
✓ What performance to expect from modern CPUs
✓ How to manage multiple models

That's a solid foundation. From here, you can:

→ Go to Tutorial 2 (Advanced) for optimization and integration
→ Integrate with OpenClaw directly (you have all the pieces)
→ Experiment and build on your own

Whatever you choose, you're now part of the local AI revolution. Your data is yours. Your LLM is yours. No cloud, no subscriptions, just you and your hardware. That's powerful. Enjoy!

Ollama Basics Complete: You're ready for the next step. Good luck! 🚀