What is Ollama and Why Should You Care?
You've probably heard the hype: run AI models locally, no API keys, no cloud costs, fully offline. That's Ollama. It's a tool that makes running large language models (LLMs) on your own hardware dead simple. Install it, pull a model, start chatting. That's it.
This tutorial assumes you're coming in cold—no deep learning background, no AI experience. Just curiosity and some decent hardware. By the end, you'll have a working local LLM and understand what you're actually doing (no black boxes). Whether you're on a laptop, mini PC, or server, the concepts are identical—only the speed varies.
Why Ollama Matters
Privacy: Your prompts stay on your laptop. No OpenAI, no Anthropic, no third party. Your data is yours.
Cost: Free. Download once, run forever. No $20/month subscriptions, no pay-per-token API fees.
Speed: Local inference is fast. No network latency waiting for cloud API responses. Your laptop responds instantly.
Offline: Internet down? Your LLM still works. Perfect for training, experimentation, or just using AI when connectivity is spotty.
What You'll Actually Learn
- ✓ Install Ollama on Linux, Mac, or Windows
- ✓ Understand which models fit your hardware (spoiler: lots of them)
- ✓ Run your first LLM and generate text
- ✓ Use Ollama's REST API for programmatic access
- ✓ Monitor performance and understand what's happening under the hood
- ✓ Manage multiple models simultaneously
What This Is NOT
This tutorial is deliberately beginner-focused. We're not covering:
- Model fine-tuning or training (that's Tutorial 2+)
- Deep learning or transformer internals
- Advanced optimization or quantization
- Production deployment or scaling
What we ARE covering: Getting Ollama running and understanding it works. That's the goal.
Does Your Hardware Cut It? (Spoiler: Yes)
Before we install, let's talk hardware. The good news: Ollama runs on almost any modern CPU. The question isn't "Can I run it?" but "How fast will it run?" That depends on your CPU cores, RAM, and storage speed. Let's figure out what to expect — and pick the right model for your machine.
Minimum Hardware Requirements
- CPU: Any modern multi-core processor (Intel, AMD, Apple Silicon)
- RAM: 8GB minimum (16GB+ recommended for 7B models with headroom)
- Storage: 20GB+ free (models range from ~2GB to ~9GB each)
- OS: Linux (Ubuntu 24.04+), macOS, or Windows with WSL2
If you have these, you can run Ollama. The key variable is speed — which depends on your CPU cores and RAM bandwidth.
Model Selection: 2026 Recommended Picks
The model landscape has moved fast. Llama 2 and Neural Chat are gone from the recommended list — there are better, faster, smarter options that run just as well on consumer hardware. Here are the three best starting points:
RAM Used: ~3GB
Speed (8-core CPU): 15–30 tokens/sec
Quality: Excellent for its size
Why: Instant responses, works on 8GB RAM, great for first test
RAM Used: ~5.5GB
Speed (8-core CPU): 8–15 tokens/sec
Quality: Excellent — punches above weight
Why: 128K context window, great at code + reasoning
RAM Used: ~5GB
Speed (8-core CPU): 8–15 tokens/sec
Quality: Very good for reasoning and code
Why: Battle-tested, huge community, reliable
phi4 (Microsoft's 14B reasoning model that rivals much larger models)
or gemma3:12b (Google's efficient series, great quality-per-GB). Both pull the same way: ollama pull phi4.
Worth the wait if you have the RAM.
Understanding Quantization Tags
When you browse ollama.com/library, you'll see tags
like q4_0, q4_K_M, and q8_0 next to model names. These are quantization levels —
how much the model has been compressed to fit in memory. Here's what they mean:
:latest resolves to. The sweet spot for most users.
Same size savings as q4_0 but noticeably better output on complex prompts.
# :latest usually resolves to q4_K_M (recommended)
ollama pull qwen2.5:7b
# Explicitly request q8 for better quality (more RAM)
ollama pull qwen2.5:7b-q8_0
# 1B model — tiny and fast for simple tasks
ollama pull llama3.2:1b
For most users, :latest is the right call. Ollama picks a good default. Only specify a quantization tag
if you're optimizing for a specific RAM budget or quality ceiling.
What NOT to Run (On Low RAM)
Some models need more headroom than others. If you're on 8GB total RAM, stick to 3B models:
- 8GB RAM: 1B–3B models comfortably (llama3.2:1b, llama3.2:3b). 7B models are tight — other apps may push you into swap.
- 16GB RAM: 7B models comfortably. Can experiment with 12–14B (phi4, gemma3:12b). Two 7B models loaded simultaneously.
- 32GB RAM: 14B models and below comfortably. Multiple concurrent 7B models. Some 32B quantized models with patience.
If you're on 8GB and a 7B model feels sluggish, try llama3.2:3b instead — it's genuinely impressive
for its size and will be noticeably more responsive.
Hardware Tiers & What to Expect
Speed varies dramatically based on your CPU. Here's what to expect running a 7B model (q4_K_M):
Budget / Older CPU (4 cores, 8 threads): 3–6 tokens/sec (usable, slower)
Mid-Range CPU (8 cores, 16 threads): 8–15 tokens/sec (very good!)
High-End CPU (12+ cores, 24+ threads): 15–25+ tokens/sec (excellent)
Apple Silicon M2/M3: 30–50+ tokens/sec (fast as GPU)
Note: Tested on systems with 16GB+ RAM and NVMe SSD storage
To put it in perspective: 10 tokens/second means a 500-token response (a long paragraph) takes about 50 seconds. That's fast for CPU-only inference. You'll be pleasantly surprised.
llama3.2:3b — it's fast, free, and works on any modern machine.
Then pull qwen2.5:7b when you're ready for something more capable. Both will surprise you.
Getting Ollama Running
Ollama installation is genuinely simple. No compilation, no complex setup, no configuration files to fiddle with. Download, run the installer, done. Let's do it.
Prerequisites
- ✓ Linux OS (Ubuntu 24.04+, Fedora, Debian, etc.)
- ✓ 8GB+ RAM (16GB+ recommended for multiple models)
- ✓ 20GB+ free disk space (models are ~4GB each)
- ✓ Stable internet connection (for downloading models once)
- ✓ Ability to run sudo commands (or use your user password)
Step 1 — Download and Install Ollama
Open a terminal and run:
curl -fsSL https://ollama.ai/install.sh | sh
This script:
- Downloads the Ollama binary for your system
- Places it in `/usr/local/bin/` (in your PATH)
- Sets up a systemd service to auto-start on boot
- Creates the ollama user and group
The installation takes 1–2 minutes. You'll see output as it progresses. When it finishes, you're done.
Step 2 — Verify Installation
Check that Ollama is installed and in your PATH:
ollama --version
You should see something like:
ollama version is 0.1.45 (or newer)
Step 3 — The Daemon is Already Running
During installation, the Ollama daemon started automatically in the background. It's listening on http://127.0.0.1:11434.
Do not run ollama serve manually — the port is already in use by the running daemon. You can verify it's active by testing the API:
curl -s http://127.0.0.1:11434/api/tags
If it responds with JSON (like {"models":[]}), the daemon is running. Good to go!
Step 4 — Three Ways to Interact with Ollama
Now that the daemon is running, here's how you can use it:
-
ollama run <model>
Interactive chat mode. Type prompts, get responses in your terminal. Great for testing and learning. -
REST API (curl/Python/etc)
Programmatic access. Send JSON requests tolocalhost:11434, get JSON responses. Perfect for integrations and scripts. -
Background daemon
The daemon runs automatically on boot and stays running. You don't manage it manually; it just works.
Step 5 — Ready for Your First Model
The daemon is running and the API is responding. Installation is complete.
You're now ready to download and run your first model. Head to the next section to pull Mistral 7B.
Installation Summary
Ollama is installed and running as a background service on http://127.0.0.1:11434.
No configuration needed. No manual daemon management. Just pull a model and start using it.
You can now:
- Pull models with
ollama pull <model> - Run interactive chat with
ollama run <model> - Make API calls to
http://127.0.0.1:11434from scripts - Start using local AI immediately — no API keys, no cloud, no costs
Download and Run Mistral 7B
Time for the moment of truth. We're going to download Mistral 7B (a smart, fast model) and run our first interactive chat session. This is where the magic happens.
Step 1 — Pull Mistral
"Pulling" a model means downloading it from Ollama's registry and storing it locally. Run:
ollama pull mistral
You'll see output like:
pulling manifest
pulling 418956b73c34... (downloading layer 1)
pulling e1cd8f6a5d4a... (downloading layer 2)
verifying sha256 digest
writing manifest
success
The full model is about 4GB. On a typical internet connection, this takes 3–5 minutes. Grab a coffee.
Step 2 — Run Interactive Chat
Once the download completes, start an interactive chat session:
ollama run mistral
You'll see the prompt appear with the model ready:
>>>
(You can see which version of Mistral loaded by checking what was pulled. Run ollama list to see all installed models and their exact versions.)
Now type a question or statement. Let's try something simple:
>>> What is Ollama?
Watch as the model generates a response in real-time. You'll see tokens appearing one by one. This is your local LLM doing inference right now, on your CPU, with no API calls, no cloud, no tracking.
Step 3 — Try More Prompts
Keep the chat session open and try different questions. Here are some good ones to test:
>>> Explain machine learning in simple terms
>>> Write a Python function that checks if a number is prime
>>> What's a good name for a Discord bot?
>>> Why is the sky blue?
>>> Tell me a joke
Notice:
- Response speed (you'll see "14 tokens/sec" or similar at the end)
- Response quality (is it coherent? Accurate?)
- CPU usage (all your cores working hard)
- No waiting for external APIs
Step 4 — Exit the Chat
To exit the interactive session, type:
>>> /bye
Or press Ctrl+D. Either works.
What Actually Happened Here?
Let's be concrete about the workflow:
- Pull: Download the model (~4GB) to
~/.ollama/models - Load: When you run Mistral, Ollama loads it into RAM (~4GB used)
- Inference: Your prompt goes to the model, which generates tokens one at a time
- Display: Each token appears on your screen as it's generated
- Repeat: You type, the model responds, until you exit
Performance Notes
The interactive mode doesn't display detailed timing information directly. However, Ollama's REST API provides complete timing metrics including generation speed, which you'll explore in the next section.
For now, what you can observe from the interactive experience:
- The response appears token-by-token in real-time
- You can estimate speed by watching how fast tokens appear (roughly 10–15 tokens/sec on mid-range CPUs is typical)
- The model loads into memory the first time you run it (notice a slight delay before responses start)
- Subsequent responses should be faster since the model stays loaded
In the next section, you'll use the REST API to make requests and see exact timing metrics (tokens/sec, total duration, load time, etc.) in JSON responses. That's where you get precise performance data.
Programmatic Access to Your LLM
Interactive chat is fun for testing, but the real power comes from using Ollama's REST API. This lets you send prompts programmatically and get responses as JSON. Perfect for integrating with OpenClaw, scripts, or custom applications.
How It Works
Ollama runs a simple HTTP server on localhost:11434. You send JSON requests, you get JSON responses.
No authentication, no setup. Just HTTP.
This is how you'll integrate Ollama with OpenClaw later. For now, let's test it with curl.
Step 1 — Simple Text Generation (Synchronous)
Open a terminal and run:
curl http://localhost:11434/api/generate \
-d '{
"model": "mistral",
"prompt": "Explain quantum computing in 2 sentences",
"stream": false
}'
The -d flag sends JSON data in the request body. The stream: false means "wait for the full
response before returning."
You'll get a JSON response that looks like:
{
"model": "mistral",
"created_at": "2026-02-22T10:30:45.123456Z",
"response": "Quantum computing exploits quantum mechanics (superposition and entanglement) to process data in fundamentally different ways than classical computers. A quantum computer can explore many solutions simultaneously, making certain problems exponentially faster to solve.",
"done": true,
"total_duration": 2500000000,
"load_duration": 300000000,
"prompt_eval_count": 12,
"prompt_eval_duration": 800000000,
"eval_count": 30,
"eval_duration": 1400000000
}
Key fields:
- response: The generated text (what you want)
- done: Whether generation is complete
- eval_count: Number of tokens generated
- eval_duration: Time spent generating (nanoseconds)
Step 2 — Parse with jq (Optional But Nice)
The full JSON response includes several fields you may not need, including a context array (used for multi-turn conversations). If you have jq installed, you can extract just what you want:
curl -s http://localhost:11434/api/generate \
-d '{
"model": "mistral",
"prompt": "Tell me a joke",
"stream": false
}' | jq -r '.response'
The -r flag means "raw output" (no quotes around the text). You'll just see:
Why don't scientists trust atoms? Because they make up everything!
Or extract key fields without the context clutter:
curl -s http://localhost:11434/api/generate \
-d '{
"model": "mistral",
"prompt": "Tell me a joke",
"stream": false
}' | jq '{response, eval_count, eval_duration}'
Much cleaner. Install jq if you don't have it: sudo apt install jq
Note on the context field: The actual API response includes a context array containing token IDs from your prompt and response. This is used for multi-turn conversations (sending context back to maintain conversation history). For single requests, you can safely ignore it or filter it out with jq as shown above.
Step 3 — Streaming API (Real-Time Responses)
For longer responses, you might want tokens to stream in real-time (like in interactive chat). Set stream: true:
curl http://localhost:11434/api/generate \
-d '{
"model": "mistral",
"prompt": "Write a haiku about programming",
"stream": true
}'
With streaming, you get multiple JSON objects (one per token), streamed line-by-line:
{"model":"mistral","created_at":"...","response":"Code","done":false}
{"model":"mistral","created_at":"...","response":" flows","done":false}
{"model":"mistral","created_at":"...","response":" like","done":false}
...
{"model":"mistral","created_at":"...","response":"","done":true}
Parse each line and print the response field to see tokens appear in real-time. This is how chat interfaces work.
Step 4 — API Parameters
You can pass additional parameters to control generation behavior:
curl http://localhost:11434/api/generate \
-d '{
"model": "mistral",
"prompt": "Complete this: The future of AI is...",
"stream": false,
"temperature": 0.7,
"top_p": 0.9,
"num_predict": 100
}'
Parameters explained (you'll dive deeper in Tutorial 2):
- temperature: Randomness (0.0=deterministic, 1.0=balanced, >1.5=creative)
- top_p: Diversity control (lower=more focused)
- num_predict: Max tokens to generate (prevents runaway responses)
For now, the defaults (no parameters) are fine. We'll explore these in Advanced.
Why This Matters for Integration
When you integrate Ollama with OpenClaw, this is what happens under the hood:
- OpenClaw receives a user message from Discord
- It constructs a JSON request to your local Ollama API
- Ollama generates a response
- OpenClaw parses the response and sends it back to Discord
All locally. All on your laptop. All in milliseconds. That's powerful.
Monitoring and Understanding Resource Usage
Now that you have Ollama running, let's look at what's actually happening under the hood. How much memory is it using? How hard is your CPU working? What can you realistically expect?
Monitor While Running
While Ollama is generating text, open another terminal and watch resource usage:
watch -n 1 'free -h'
This updates every second showing your RAM usage. A typical 7B model uses about 5–6GB.
total used free shared buff/cache available
Mem: 31Gi 6.2Gi 22Gi 1.3Gi 3.4Gi 23Gi
This example shows Ollama using ~6GB out of 31GB total—plenty of breathing room. Your system will vary.
For CPU usage, in another terminal:
top -n 1 -o %CPU
Or use htop for a nicer interface:
htop
During inference, you'll see all your CPU cores at high utilization (70–95%). This is normal and expected. Your CPU is working hard, which is why you get good performance.
Performance Benchmarks by Hardware
Here's what to expect for a single Mistral 7B model on different hardware. All numbers assume 16GB+ RAM and SSD storage:
Budget / Older CPU (4 cores):
Tokens/Sec: 3–6 tokens/sec (slow but usable)
Time to First Token: 1–2 seconds
RAM Used: ~6GB
Mid-Range CPU (8 cores):
Tokens/Sec: 8–15 tokens/sec (very good)
Time to First Token: 500ms–1 second
RAM Used: ~6GB
High-End CPU (12+ cores):
Tokens/Sec: 15–25+ tokens/sec (excellent)
Time to First Token: 300–500ms
RAM Used: ~6GB
Universal:
Model Size on Disk: ~4GB
Model Load Time: 1–3 seconds (SSD-dependent)
CPU Utilization: 70–95% during generation
Where you land on this spectrum depends on your CPU core count and clock speed. The good news: even budget CPUs generate text at usable speeds.
What Affects Performance?
Token generation speed varies based on several factors:
- Prompt Length: Longer prompts take longer to process before generating
- Response Length: More tokens to generate = longer total time (linear relationship)
- System Load: Other apps running = fewer CPU cycles for Ollama
- Model Size: Bigger models (13B+) are much slower on CPU
- SSD Speed: Slow first model load if SSD is bottleneck (yours is fast)
Is 10–15 Tokens/Sec Fast Enough?
Let's put it in perspective:
Short Answer (50 tokens): ~3–5 seconds
Medium Answer (200 tokens): ~13–20 seconds
Long Answer (500 tokens): ~33–50 seconds
Full Essay (1000 tokens): ~65–100 seconds
For comparison:
- OpenAI's API: Also 10–20 tokens/sec, but costs money and requires internet
- Claude/GPT directly: Similar speed with 100x the cost
- Instant messaging apps: Slower (typing speed is 40–60 words/min = ~10 tokens/sec)
So yes, 10–15 tokens/sec is genuinely fast. You're getting good performance locally, for free.
Sustained Running (24/7 Concerns)
Most hardware can run Ollama continuously without issues:
- Thermals: CPU inference doesn't generate excessive heat. Modern cooling handles it fine.
- Battery Drain: On laptops with battery, expect 1–2 hours per charge during heavy continuous use.
- Reliability: Modern CPUs are designed for sustained loads. No degradation over time.
- Memory Leaks: Ollama is stable. No memory creep after hours of running.
For occasional or development use on a laptop/desktop, your current hardware is fine. If you want to run Ollama 24/7 at scale with multiple concurrent requests, consider a dedicated server or device with more consistent power delivery.
Download, List, and Switch Between Models
One model is useful. A library of models is powerful. Different models have different strengths — a small 3B model for quick queries, a 7B all-rounder for complex work, a reasoning model when you need to think things through. Ollama makes managing all of them trivial.
Step 1 — Download More Models
Let's pull two more models to build out your library. The first is Qwen2.5:7b — an excellent all-rounder with a massive 128K context window, great for long documents and complex tasks:
ollama pull qwen2.5:7b
And the tiny-but-capable Llama 3.2 1B — useful when you want an instant response and don't need heavy reasoning:
ollama pull llama3.2:1b
The 1B model is only ~1.3GB and loads in seconds. It's your speed tier — great for quick lookups and simple tasks when you don't want to wait for a 7B model to warm up.
Step 2 — List Your Models
See everything you've downloaded:
ollama list
Output:
NAME ID SIZE MODIFIED
qwen2.5:7b 845dbda0ea48 4.7 GB 2 hours ago
mistral:latest f974a74358d6 4.1 GB 3 hours ago
llama3.2:1b baf6a787fdff 1.3 GB 1 hour ago
Three models, ~10GB total. The key is they cover different use cases — speed, quality, and proven reliability.
Step 3 — See What's Actually Running
ollama list shows what's downloaded. ollama ps shows what's currently
loaded in memory — this is the command you want when troubleshooting performance or wondering why
your RAM is full:
ollama ps
NAME ID SIZE PROCESSOR UNTIL
qwen2.5:7b 845dbda0ea48 5.5 GB 100% CPU 4 minutes from now
mistral:latest f974a74358d6 5.0 GB 100% CPU Expires in 2 minutes
Step 4 — Control Keep-Alive (OLLAMA_KEEP_ALIVE)
By default, Ollama unloads a model from RAM 5 minutes after it was last used. This is good for shared machines and 8GB systems. But if you're the only user and have the RAM, keeping models loaded means instant responses with no cold-start delay.
# Keep models loaded for 30 minutes of idle time
export OLLAMA_KEEP_ALIVE=30m
# Keep models loaded indefinitely (until you restart Ollama)
export OLLAMA_KEEP_ALIVE=-1
# Use the default 5-minute unload (default behavior)
export OLLAMA_KEEP_ALIVE=5m
# Apply it permanently (add to ~/.bashrc or ~/.zshrc)
echo 'export OLLAMA_KEEP_ALIVE=30m' >> ~/.bashrc
source ~/.bashrc
30m or longer — a loaded 7B model uses
~5GB and the performance gain is significant. On 8GB RAM, keep the default 5m so the model frees
memory when you switch tasks.
To apply OLLAMA_KEEP_ALIVE to the Ollama systemd service (so it persists across reboots), add it to the service's environment:
sudo systemctl edit ollama
[Service]
Environment="OLLAMA_KEEP_ALIVE=30m"
sudo systemctl daemon-reload
sudo systemctl restart ollama
Step 5 — Switch Between Models
Switch is simple — just run a different model name:
ollama run qwen2.5:7b
You're now in Qwen's interactive session. When done:
>>> /bye
Then switch to the fast tier:
ollama run llama3.2:1b
Step 6 — Run Different Models via API
With the REST API, you can specify which model to use per request. This is how OpenClaw routes different task types to different models:
curl http://localhost:11434/api/generate \
-d '{
"model": "qwen2.5:7b",
"prompt": "Analyze this error and suggest a fix...",
"stream": false
}'
curl http://localhost:11434/api/generate \
-d '{
"model": "llama3.2:1b",
"prompt": "What does HTTP 429 mean?",
"stream": false
}'
Ollama queues requests and processes them in order. On 16GB+ RAM with keep-alive set, both models can stay loaded and the switch between them is instantaneous — no unload/reload delay.
Current Popular Models Worth Trying
All of these are 4-bit quantized by default and pull directly from Ollama's library:
ollama pull llama3.2:1b # 1.3GB — instant speed, simple tasks
ollama pull llama3.2:3b # 2GB — small and surprisingly capable
ollama pull mistral:latest # 4.1GB — proven reasoning + code workhorse
ollama pull qwen2.5:7b # 4.7GB — best all-rounder, 128K context
ollama pull phi4 # 9.1GB — Microsoft's reasoning model, needs 16GB RAM
ollama pull gemma3:12b # 8.1GB — Google's efficient model, needs 16GB RAM
Start with llama3.2:3b and qwen2.5:7b. That covers 90% of use cases.
Add phi4 or gemma3:12b if you have 16GB+ and want higher quality on complex reasoning tasks.
Disk Space Management
Models live in ~/.ollama/models. Check your usage:
du -sh ~/.ollama/models
To remove a model and free disk space:
ollama rm llama3.2:1b
Re-pull anytime — Ollama uses layer caching, so re-downloading a model you've had before is faster than the initial pull. Models that share architecture (like Llama 3.2 1B and 3B) share base layers, so the second model is much smaller to download.
Memory Limits (Hardware Dependent)
Ollama loads models on-demand into RAM. Here's what you need to know:
- One 7B model (q4_K_M): ~5–6GB RAM (plus OS overhead, so ~8GB total consumed)
- One 3B model: ~2.5GB RAM — very comfortable on 8GB systems
- Two 7B models loaded: ~12GB — comfortable on 16GB systems
- One 14B model (phi4): ~9GB — needs 16GB, leaves headroom
Ollama gracefully unloads models you're not using when memory gets tight. With OLLAMA_KEEP_ALIVE set, models stay warm as long as you have the RAM to support them.
ollama ps, and tune keep-alive to match your hardware. Next: integrating Ollama with OpenClaw.
Where to Go From Here
You have a working local LLM. You understand how to download models, run them interactively, access them via API, and manage multiple models. That's the foundation. Now: what's next?
Option 1: Dive Into Tutorial 2 (Advanced)
Ready to squeeze maximum performance and integrate with OpenClaw? That's the Advanced tutorial.
You'll learn:
- Parameter Tuning: Temperature, top_p, context windows—control how the model behaves
- Benchmarking: Measure and compare model quality and speed objectively
- GPU Detection: Understand when and how to use GPU acceleration (spoiler: you don't need it)
- OpenClaw Integration: Connect Ollama to your OpenClaw bot (local AI brain!)
- Optimization: Squeeze the last bit of performance from your CPU with tuning and hardware-specific optimizations
Advanced is 90 minutes. Do it when you're ready to move beyond "just works" to "optimized."
Option 2: Integrate With OpenClaw Now
If you want to skip advanced tuning and go straight to using Ollama with OpenClaw, you can do that. You have everything you need:
- ✓ Ollama running (listening on localhost:11434)
- ✓ Model loaded (Mistral or Llama 2)
- ✓ REST API working (tested with curl)
Point OpenClaw to your local Ollama API instead of OpenAI. Your bot will have a local LLM brain. See the Advanced tutorial for integration details when you're ready.
Option 3: Experiment on Your Own
Try different models. Compare them. Run them with different parameters. Write scripts to benchmark. Build a tiny chatbot using Python + Ollama API.
Some ideas:
- Write a Python script that sends prompts to Ollama and logs response times
- Compare Mistral vs Llama2 on the same prompts (which is smarter?)
- Build a simple CLI chatbot using
ollama.pyorrequestslibrary - Experiment with streaming API to build a live chat interface
- Download a 13B model and benchmark it (for curiosity, even if slow)
Option 4: Learn More (Self-Study)
If you want to deepen your understanding:
-
Ollama GitHub
Source code, issues, community discussions -
Hugging Face Model Hub
Thousands of models, descriptions, benchmarks -
YouTube Tutorials
Video walkthroughs from the community -
Transformers (Deep Dive)
If you want to understand the math behind LLMs
Common Next Questions
Q: Can I run this in the background permanently?
A: Yes! Ollama already auto-starts as a systemd service. It'll run on boot and keep running.
Q: Can I access Ollama from other machines (not localhost)?
A: Not by default (security). You'd need to expose the API with an nginx proxy or SSH tunnel (advanced).
Q: What if I want to upgrade to a bigger model later?
A: You can, but 13B+ models are significantly slower on CPU. Consider upgrading hardware to GPU.
Q: Is there a web UI for Ollama?
A: Not built-in, but the community has built web interfaces. Or use the CLI/API directly.
Q: Can I use Ollama on Windows/Mac?
A: Yes! Ollama has installers for all platforms. Same concepts apply.
You're Ready
You've completed Ollama Basics. You understand:
- ✓ What Ollama is and why it matters
- ✓ How to install and verify it works
- ✓ How to download and run models
- ✓ How to use the REST API programmatically
- ✓ What performance to expect from modern CPUs
- ✓ How to manage multiple models
That's a solid foundation. From here, you can:
- → Go to Tutorial 2 (Advanced) for optimization and integration
- → Integrate with OpenClaw directly (you have all the pieces)
- → Experiment and build on your own
Whatever you choose, you're now part of the local AI revolution. Your data is yours. Your LLM is yours. No cloud, no subscriptions, just you and your hardware. That's powerful. Enjoy!