How to Run Local LLMs with Ollama

LLM AI Chat Bot

Running local LLMs with Ollama is an exciting way to bring powerful AI models directly on your hardware, enhancing privacy, network dependancy, and reducing development costs.

🚀 Why Run Local LLMs?

  • Privacy: Your data stays entirely on your local machine, crucial for sensitive information.
  • Speed and Availability: No network latency or Availability problems
  • Pricing: Enhancing privacy, reducing network dependency, and lowering development costs

🛠 Getting Started with Ollama

Download

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Visit the Ollama download page and download the installer (.exe). MacOS Visit the Ollama download page and download the installer (.app)

Pulling Models

To download a model:

ollama pull <MODEL>

Example with Llama2:

ollama pull llama2

Running Models

Run a model directly from your terminal:

ollama run <MODEL>

Example:

ollama run llama2

Interact with your model by typing prompts. To exit, simply type:

/bye

Saving and Loading chats

Ollama provides simple commands to save and load chat histories. You can save a chat session directly from your terminal by running when in a chat:

/save chat-session-name

Later, load the saved session to continue your conversation:

/load chat-session-name

This makes it easy to pick up conversations right where you left off.

✨ Using the Ollama API

Create a simple terminal-based chat application:

import json

import requests

# Initialize chat history
messages = [{"role": "system", "content": "You're my assistant named Jarvis. I'm Tony"}]

while True:
    try:
        user_input = input("\nYou: ")
    except KeyboardInterrupt:
        print("\nGoodbye!")
        break

    if user_input.lower() in ["exit", "quit"]:
        print("Ending chat...")
        break

    # Add user message to history
    messages.append({"role": "user", "content": user_input})

    # Stream the assistant's response
    print("Jarvis: ", end="", flush=True)
    with requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "llama2",
            "prompt": "\n".join([f"{m['role']}: {m['content']}" for m in messages]),
            "stream": True,
        },
        stream=True,
    ) as response:
        response.raise_for_status()
        full_response = ""
        for line in response.iter_lines():
            if line:
                token = json.loads(line.decode("utf-8"))
                chunk = token.get("response", "")
                if chunk:
                    full_response += chunk
                    print(chunk, end="", flush=True)
                if token.get("done"):
                    break

    # Add assistant response to history
    messages.append({"role": "assistant", "content": full_response})
You: Hello Jarvis
Jarvis: Hello, Mr. Stark! *adjusts glasses* It's a pleasure to assist you as always. Is there something you need help with today?
You:   

🐍 Using Ollama python package:

Note: Before running the Python script below, ensure you’ve installed Ollama:

pip install ollama
import ollama

MODEL_NAME = "llama2"
ollama.pull(MODEL_NAME)

SYSTEM_PROMPT = "You're my assistant named Jarvis. I'm Tony"
USER_PROMPT = "Hello, how are you?"

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": USER_PROMPT},
]


print("Jarvis: ", end="", flush=True)

response = ollama.chat(model=MODEL_NAME, messages=messages, stream=True)

# Print response chunks progressively
full_response = []
for chunk in response:
    content = chunk.get("message", {}).get("content", "")
    if content:
        print(content, end="", flush=True)
        full_response.append(content)

print("\n")  # Add final newlines

Example Output:

Jarvis:
Ah, good day to you, sir! *adjusts monocle* I am functioning within normal parameters, thank you for inquiring. How may I assist you today? Is there something in particular you require, or perhaps a task you'd like me to attend to? 😊

🔁 Using Ollama with OpenAi Compatibility:

To make ollama compatible with OpenAi format is very simple you juist need to initialize your clint using ollama

client = OpenAI(
    base_url = 'http://localhost:11434/v1',
    api_key='ollama', # required, but unused
)

response = client.chat.completions.create(
  model="llama2",
  messages=[
    {"role": "system", "content": "..."},
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "..."},
  ]
)

response.choices[0].message.content

This can be usefull for checking you code workflow and OpenAI calls without charges

🧠 Advanced Use: Combine Local and Cloud Models

You can leverage advanced reasoning models like DeepSeek to perform complex tasks and even orchestrate calls to more powerful external models, such as GPT-4. This approach allows combining the speed and privacy of local models with the advanced capabilities of external APIs.

Pull a reasoning model with:

ollama pull deepseek-r1:7b

Here’s an example demonstrating how to use DeepSeek locally to make API calls to GPT-4 for tasks that require advanced reasoning:

import ollama
import openai

# Get initial response from local model
ollama.pull("deepseek-r1:7b")
reasoning_response = ollama.generate(
    "deepseek-r1:7b",
    prompt="Analyze and explain the core reasoning steps for: [Your Complex Issue Here]",
)

reasoning_response = reasoning_response["response"].split("</think>")[0].strip()

api_response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "Expand only on these reasoning elements:"},
        {"role": "user", "content": reasoning_response},
    ],
)

print(api_response)

This approach optimally leverages local and remote AI resources.

✨ Wrap-Up

Whether you’re prototyping, building products, or just experimenting, Ollama makes it simple and powerful. If you build something cool, let me know—I’d love to check it out!