Adding More Nodes

Adding Nodes to a Spheron Provider

This document provides step-by-step instructions for adding both CPU and GPU nodes to a Spheron cluster. It includes preparations, script modifications, and necessary commands to ensure successful integration and functionality of the nodes within the cluster.

  1. Adding a CPU Node
  2. Adding a GPU Node

Adding a CPU Node

To add a CPU node to your Spheron cluster, follow these steps:

Preparing a Node for Installation

1. Prepare a node for the installation using Ansible from the earlier step.

2. Clone the Provider Deployment Repository

If you haven't already, clone the Spheron provider deployment repository: spheronFdn/provider-deployment (opens in a new tab)

git clone https://github.com/spheronFdn/provider-deployment.git

Repo has the following file structure: Provider Deployment Repo

3. Edit the Inventory File

Open playbook/inventory.ini and update it with your server details. Example:

Ensure you set up the server IP, username, and SSH key correctly.

[testnet] 
server-name ansible_host=23.158.40.38 ansible_user=root ansible_ssh_private_key_file=~/.ssh/id_rsa

4. Execute the Ansible Playbook

cd playbook
ansible-playbook -i inventory.ini playbook.yml
  • The server will restart after the execution. When prompted, say No to the request for the first node.

Provider Deployment Repo

  • Follow the Prompt like this:

Provider Deployment Repo

5. SSH into the Master Node

SSH into the first node (master node) of the cluster and follow the steps:

  • Add the Node Using the add-agent.sh Script, run the following commands:
sudo su spheron
cd
wget -q https://raw.githubusercontent.com/spheronFdn/provider-deployment/main/scripts/add-agent.sh
  • Use Vim or Nano to edit the add-agent.sh script. Update the master node IP and add child node IPs:
vim add-agent.sh

Edit the following lines:

SPHERON_NODE1_IP=[134.195.196.81] # your master node
 
# all you child nodes
nodes=(
    ["spheron-node2"]="134.195.196.213" ## add nodes like this in the list and change the node name if you want
)
  • Run the Script on the Master Node
sudo chmod +x add-agent.sh
./add-agent.sh

Adding a GPU Node

To add a GPU node to your Spheron cluster, follow these steps:

Preparing a Node for Installation

1. Prepare a node for the installation using Ansible from the earlier step.

2. Clone the Provider Deployment Repository

If you haven't already, clone the Spheron provider deployment repository: spheronFdn/provider-deployment (opens in a new tab)

git clone https://github.com/spheronFdn/provider-deployment.git

Repo has the following file structure: Provider Deployment Repo

3. Edit the Inventory File

Open playbook/inventory.ini and update it with your server details. Example:

Ensure you set up the server IP, username, and SSH key correctly.

[testnet] 
server-name ansible_host=23.158.40.38 ansible_user=root ansible_ssh_private_key_file=~/.ssh/id_rsa

4. Execute the Ansible Playbook

cd playbook
ansible-playbook -i inventory.ini playbook.yml
  • The server will restart after the execution. When prompted, say No to the request for the first node.

Provider Deployment Repo

  • Follow the Prompt like this:

Provider Deployment Repo

It will isntall the GPU drivers and some scripts.

5. SSH into the Master Node

SSH into the first node (master node) of the cluster and follow the steps:

  • Add the Node Using the add-agent.sh Script, run the following commands:
sudo su spheron
cd
wget -q https://raw.githubusercontent.com/spheronFdn/provider-deployment/main/scripts/add-agent.sh
  • Use Vim or Nano editor to edit the add-agent.sh script. Update the master node IP and add child node IPs:
vim add-agent.sh
  • Edit the following lines:
SPHERON_NODE1_IP=134.195.196.81 # your master node
 
# all you child nodes
nodes=(
    ["spheron-node2"]="134.195.196.213" ## add nodes like this in the list and change the node name if you want
)
  • Run the Script on the Master Node
sudo chmod +x add-agent.sh
./add-agent.sh

Install the Nvidia Driver (First GPU Node Only)

⚠️

NOTE: Only Execute the next 2 step in this section if this is the first GPU Node in your cluster.

  1. If this is the first GPU node, run these commands:
sudo su spheron
cd
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin

helm repo update

kubectl apply -f /home/spheron/gpu-nvidia-runtime-class.yaml

helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --version 0.14.5 \
  --set runtimeClassName="nvidia"
  1. Create the Nvidia RuntimeClass using the root user.
⚠️

NOTE: Only Execute if this is the first GPU node.

sudo su 
# Create NVIDIA RuntimeClass
cat > /home/spheron/gpu-nvidia-runtime-class.yaml <<EOF
kind: RuntimeClass
apiVersion: node.k8s.io/v1
metadata:
  name: nvidia
handler: nvidia
EOF
sudo su spheron

Configure the New Node

⚠️

NOTE: Execute this on the new node which has the GPU in it by SSH into it.

  1. SSH into the new node and check if the file /etc/rancher/k3/config.yaml exists:
cat /etc/rancher/k3/config.yaml
  1. If the above command doesn't show any output, create the file using following command:
cat > /etc/rancher/k3/config.yaml <<'EOF'
containerd_additional_runtimes:
  - name: nvidia
    type: "io.containerd.runc.v2"
    engine: ""
    root: ""
    options:
      BinaryName: '/usr/bin/nvidia-container-runtime'
EOF

Create a GPU Test Pod (First GPU Node Only)

⚠️

NOTE:

  • Only Execute these next steps in this section if this is the first GPU Node in your cluster.
  • Please SSH back to the master node for the next steps.

SSH back to the master node and create a GPU test pod to check if the GPU is configured successfully with Kubernetes.

cat > gpu-test-pod.yaml << EOF
apiVersion: v1
kind: Pod
metadata:
  name: nbody-gpu-benchmark
  namespace: default
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
  - name: cuda-container
    image: nvcr.io/nvidia/k8s/cuda-sample:nbody
    args: ["nbody", "-gpu", "-benchmark"]
    resources:
      limits:
        nvidia.com/gpu: 1
    env:
    - name: NVIDIA_VISIBLE_DEVICES
      value: all
    - name: NVIDIA_DRIVER_CAPABILITIES
      value: all
EOF
kubectl apply -f gpu-test-pod.yaml
echo "Waiting 60 seconds for the test pod to start..."
sleep 60
kubectl get pods -A -o wide
kubectl logs nbody-gpu-benchmark
kubectl delete pod nbody-gpu-benchmark

Update Provider Capabilities

On the new GPU node, update the provider configuration to add your GPU / CPU hardware and units:

  1. Open the configuration file in a text editor:
vi /home/spheron/.spheron/provider-config.json

Note: Refer to Provider Configuration for checking the configuation structure and update it based on that.

  1. Update the provider configuration into the provider-config.json file and save it.

  2. Now update the config onchain by running the below commands:

sphnctl wallet use --name wallet --key-secret testPassword
sphnctl provider update --config /home/spheron/.spheron/provider-config.json
  1. Then, you need to set provider GPU & CPU attributes for Spheron liveness rewards:
sphnctl provider set-attribute --config ~/.spheron/provider-config.json

Note: If you get RPC error which running these command, you can retry it or reach out to team with the issue.

Restart the Provider

Restart the provider on new GPU node to apply the new capabilities:

kubectl rollout restart statefulset/spheron-provider -n spheron-services
kubectl rollout restart deployment/operator-inventory -n spheron-services   

Verify the node has GPU labels

Note: To find the node name for this step, you need to execute the below step and take the first name in the list:

kubectl get nodes
kubectl describe node [Node Name] | grep -A10 Labels

Now your provider should have GPU or new CPU hardware working 🚀🚀🚀

Setup ProviderReward Details