Deploying Ollama Server
Learn how to deploy an Ollama server on Spheron with the ability to load any model from the Ollama registry.
Prerequisites
- Spheron account
- Basic knowledge of OpenAI API & SDK
- Knowledge of which Ollama models you want to use, if not you can search for models on ollama.com
Deployment Steps
Access Spheron Console
- Navigate to console.spheron.network
- Log in to your account
- If you are new to Spheron, you should already have a free credits balance of $20. If not, please reach out to us on Discord to get a free credits balance.
Select a GPU
- Go to Marketplace tab
- You have 2 options to choose from:
- Secure: For deploying on secure and data center grade provider. It is super reliable but costs more.
- Community: For deploying on community fizz nodes that are running on someones home machine. It might not be very reliable.
- Now select any GPU you want to deploy on. You can also search the GPU name to find the exact GPU you want to deploy on.
Configure the Deployment
- Select the template Ollama Server
- Put any model name in
OLLAMA_MODEL
field that you want to preload on the Ollama Server. If you don’t know which model to use, you can search for models on ollama.com - If you want you can increase the GPU count to access multiple GPUs at once.
- You can select the duration of the deployment.
- Click on Confirm button to start the deployment
- Deployment will be done in less than 60 seconds
Using Your Ollama Server
- Once deployed, go to Overview tab.
- Click on ollama-test service to open the Ollama Server service.
- You can use the connection url to access the Ollama Server.
- You can also use the connection url to load models using the Ollama API:
curl http://your-deployment-url/api/tags
Model Management on Ollama Server
- List available models:
curl http://your-deployment-url/api/tags
- Run inference:
curl -X POST http://your-deployment-url/api/generate -d '{"model": "llama2", "prompt": "Hello!"}'
Best Practices
- Start with smaller models first
- Consider GPU instances for better performance and faster inference
Last updated on