How to Run Local LLMs with Ollama - Pedro Nunes Pagnussat

Running local LLMs with Ollama is an exciting way to bring powerful AI models directly on your hardware, enhancing privacy, network dependancy, and reducing development costs.

🚀 Why Run Local LLMs?

Privacy: Your data stays entirely on your local machine, crucial for sensitive information.
Speed and Availability: No network latency or Availability problems
Pricing: Enhancing privacy, reducing network dependency, and lowering development costs

🛠 Getting Started with Ollama

Download

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Visit the Ollama download page and download the installer (.exe). MacOS Visit the Ollama download page and download the installer (.app)

Pulling Models

To download a model:

ollama pull <MODEL>

Example with Llama2:

ollama pull llama2

Running Models

Run a model directly from your terminal:

ollama run <MODEL>

Example:

ollama run llama2

Interact with your model by typing prompts. To exit, simply type:

/bye

Saving and Loading chats

Ollama provides simple commands to save and load chat histories. You can save a chat session directly from your terminal by running when in a chat:

/save chat-session-name

Later, load the saved session to continue your conversation:

/load chat-session-name

This makes it easy to pick up conversations right where you left off.

✨ Using the Ollama API

Create a simple terminal-based chat application:

import json

import requests

# Initialize chat history
messages = [{"role": "system", "content": "You're my assistant named Jarvis. I'm Tony"}]

while True:
    try:
        user_input = input("\nYou: ")
    except KeyboardInterrupt:
        print("\nGoodbye!")
        break

    if user_input.lower() in ["exit", "quit"]:
        print("Ending chat...")
        break

    # Add user message to history
    messages.append({"role": "user", "content": user_input})

    # Stream the assistant's response
    print("Jarvis: ", end="", flush=True)
    with requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": "llama2",
            "prompt": "\n".join([f"{m['role']}: {m['content']}" for m in messages]),
            "stream": True,
        },
        stream=True,
    ) as response:
        response.raise_for_status()
        full_response = ""
        for line in response.iter_lines():
            if line:
                token = json.loads(line.decode("utf-8"))
                chunk = token.get("response", "")
                if chunk:
                    full_response += chunk
                    print(chunk, end="", flush=True)
                if token.get("done"):
                    break

    # Add assistant response to history
    messages.append({"role": "assistant", "content": full_response})

You: Hello Jarvis
Jarvis: Hello, Mr. Stark! *adjusts glasses* It's a pleasure to assist you as always. Is there something you need help with today?
You:

🐍 Using Ollama python package:

Note: Before running the Python script below, ensure you’ve installed Ollama:

pip install ollama

import ollama

MODEL_NAME = "llama2"
ollama.pull(MODEL_NAME)

SYSTEM_PROMPT = "You're my assistant named Jarvis. I'm Tony"
USER_PROMPT = "Hello, how are you?"

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": USER_PROMPT},
]


print("Jarvis: ", end="", flush=True)

response = ollama.chat(model=MODEL_NAME, messages=messages, stream=True)

# Print response chunks progressively
full_response = []
for chunk in response:
    content = chunk.get("message", {}).get("content", "")
    if content:
        print(content, end="", flush=True)
        full_response.append(content)

print("\n")  # Add final newlines

Example Output:

Jarvis:
Ah, good day to you, sir! *adjusts monocle* I am functioning within normal parameters, thank you for inquiring. How may I assist you today? Is there something in particular you require, or perhaps a task you'd like me to attend to? 😊

🔁 Using Ollama with OpenAi Compatibility:

To make ollama compatible with OpenAi format is very simple you juist need to initialize your clint using ollama

client = OpenAI(
    base_url = 'http://localhost:11434/v1',
    api_key='ollama', # required, but unused
)

response = client.chat.completions.create(
  model="llama2",
  messages=[
    {"role": "system", "content": "..."},
    {"role": "user", "content": "..."},
    {"role": "assistant", "content": "..."},
  ]
)

response.choices[0].message.content

This can be usefull for checking you code workflow and OpenAI calls without charges

🧠 Advanced Use: Combine Local and Cloud Models

You can leverage advanced reasoning models like DeepSeek to perform complex tasks and even orchestrate calls to more powerful external models, such as GPT-4. This approach allows combining the speed and privacy of local models with the advanced capabilities of external APIs.

Pull a reasoning model with:

ollama pull deepseek-r1:7b

Here’s an example demonstrating how to use DeepSeek locally to make API calls to GPT-4 for tasks that require advanced reasoning:

import ollama
import openai

# Get initial response from local model
ollama.pull("deepseek-r1:7b")
reasoning_response = ollama.generate(
    "deepseek-r1:7b",
    prompt="Analyze and explain the core reasoning steps for: [Your Complex Issue Here]",
)

reasoning_response = reasoning_response["response"].split("</think>")[0].strip()

api_response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "Expand only on these reasoning elements:"},
        {"role": "user", "content": reasoning_response},
    ],
)

print(api_response)

This approach optimally leverages local and remote AI resources.

✨ Wrap-Up

Whether you’re prototyping, building products, or just experimenting, Ollama makes it simple and powerful. If you build something cool, let me know—I’d love to check it out!