Skip to Content
GuidesDeploying Ollama Server

Deploying Ollama Server

Learn how to deploy an Ollama server on Spheron with the ability to load any model from the Ollama registry.

Prerequisites

  • Spheron account
  • Basic knowledge of OpenAI API & SDK
  • Knowledge of which Ollama models you want to use, if not you can search for models on ollama.com

Deployment Steps

Access Spheron Console

  1. Navigate to console.spheron.network
  2. Log in to your account
  3. If you are new to Spheron, you should already have a free credits balance of $20. If not, please reach out to us on Discord to get a free credits balance.

Select a GPU

  1. Go to Marketplace tab
  2. You have 2 options to choose from:
    • Secure: For deploying on secure and data center grade provider. It is super reliable but costs more.
    • Community: For deploying on community fizz nodes that are running on someones home machine. It might not be very reliable.
  3. Now select any GPU you want to deploy on. You can also search the GPU name to find the exact GPU you want to deploy on.

Configure the Deployment

  1. Select the template Ollama Server
  2. Put any model name in OLLAMA_MODEL field that you want to preload on the Ollama Server. If you don’t know which model to use, you can search for models on ollama.com
  3. If you want you can increase the GPU count to access multiple GPUs at once.
  4. You can select the duration of the deployment.
  5. Click on Confirm button to start the deployment
  6. Deployment will be done in less than 60 seconds

Using Your Ollama Server

  1. Once deployed, go to Overview tab.
  2. Click on ollama-test service to open the Ollama Server service.
  3. You can use the connection url to access the Ollama Server.
  4. You can also use the connection url to load models using the Ollama API:
curl http://your-deployment-url/api/tags

Model Management on Ollama Server

  • List available models:
curl http://your-deployment-url/api/tags
  • Run inference:
curl -X POST http://your-deployment-url/api/generate -d '{"model": "llama2", "prompt": "Hello!"}'

Best Practices

  • Start with smaller models first
  • Consider GPU instances for better performance and faster inference
Last updated on