Using CUDA and Llama-cpp to Run a Phi-3-Small-128K-Instruct Model on IBM Cloud VSI with GPUs

These days, CUDA, llama.cpp and using optimized GGUF format for models are getting more popular. Let us see what we need to run the optimized “Phi-3-Small-128K-Instruct” in a GGUF format with llama.cpp on an IBM Cloud Virtual Server Instance with GPUs and an Ubuntu 22.04 operating system.

Note: The 128k in the name contains the information on the model input token size.

These are the steps we need to take care of:

Step 1: Setup a VSI with a GPU on IBM Cloud

We can follow the steps in the linked blog post to set up the Virtual Server Instance. How do you initially set up a Virtual Server Instance with a GPU in IBM Cloud?

Step 2: We need to ensure to include the path to the NVIDIA CUDA Compiler Driver NVCC in the $PATH variable. The `nvcc` is a part of the `nvidia-cuda-toolkit`.

If we followed the instructions of the blog post in the Step 1 we notice that the nvcc is not available, but the nvidia-cuda-toolkit was installed, so we need to find the installation location.

Note: Later, we need the nvcc application, because the application will be invoked by the makefile in llama.cpp.

Finding the nvcc application on our machine with following command:

find / | grep nvcc

Output:

...
/usr/local/cuda-12.5/bin/nvcc
...

Create a .bashrc file:

cd ~/
nano .bashrc

Insert the following variable to the .bashrc file:

export NVCC_CUDA_TOOLS="/usr/local/cuda-12.5/bin/"
export PATH=$PATH:$NVCC_CUDA_TOOLS
# Define the CUDA_HOME variable  
export CUDA_HOME="/usr/local/cuda-12.5/"

Verify if the command is now available:

nvcc

Step 3: Ensure we have the latest gcc compiler

Later, we will compile the llama.cpp so we just ensure that we have the latest gcc compiler in place.

sudo apt install gcc
sudo apt install --reinstall gcc-12

Step 4: Compile the `llama.cpp` to use

This step is the most critical step for us because only with the right compilation we can use the GPUs. The following text extracts partly the documentation from the llama.cpp.

CUDA This provides GPU acceleration using the CUDA cores of your Nvidia GPU. Make sure to have the CUDA toolkit installed.

Here we use the parameter LLAMA_CUDA=1

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_CUDA=1

Step 5: We generate a virtual Python environment

python3 -m venv my-env
source my-env/bin/activate

Step 6: We need to ensure that `llama-cpp-python` uses our compiled version

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir

Step 7: We download the needed model

We ensure that we can use the Huggingface CLI to download the model file.

Install the CLI

python3 -m pip install huggingface-hub

Download the model

huggingface-cli download QuantFactory/Phi-3-mini-128k-instruct-GGUF --local-dir . --local-dir-use-symlinks False --include='*Q8_0*gguf'

Step 8: Example code to run the model

The following code is an example of how to load a prompt from a pdf file.

import os
from llama_cpp import Llama
from PyPDF2 import PdfReader

###########
# Functions
def load_input (prompt):
 filepath = os.path.abspath("./prompts/prompt.pdf")
    with open(filepath, 'rb') as file:
 pdf_reader = PdfReader(file)
 prompt = []
        for i in range(0, len(pdf_reader.pages)):
 prompt.append(pdf_reader.pages[i].extract_text())
 prompt = '\n'.join(prompt)
    return prompt

###########
# Execution

# 1. Load prompt from pdf
prompt = load_input(prompt)

# 2. Load the model gguf file
model_path = "./models/Phi-3-mini-128k-instruct.Q8_0.gguf"
LLM = Llama(
 model_path=model_path,
 n_gpu_layers=-1, # Uncomment to use GPU acceleration (If -1, all layers are offloaded)
      # seed=1337, # Uncomment to set a specific seed
 n_ctx=128000, # Uncomment to increase the context window
 echo=False
)

# 3. Generate an LLM response
output = llm.create_completion(prompt,
 max_tokens=1000,
 repeat_penalty=1.2,
 temperature=0.0
)

print("\n*****\nLLM Output:\n",output['choices'][0]['text'])

Step 9: Run the model

Run the model in the first terminal

source ./my-env/bin/activate
python3 -m pip install hugging face-hub
python3 -m pip install PyPDF2
python3 run_example_code.py

Monitor the GPU usage in the second terminal

watch -n0.5 nvidia-smi

2. Summary

These are the essential takeaways from my perspective:

Ensure you use the correct nvcc application version
Ensure to compile llama-cpp for the right platform
Ensure you use the correct compiled version of llama-cpp-python in your Python code

3. Additional resources

The following resource may be helpful in this context.

Running Mistral on CPU via llama.cpp Blog post from Niklas Heidloff
Pypi.org Llama cpp Python Library documentation
Install llama-cpp-python with GPU Support Blog post from Manish Kovelamudi
How to Install CUDA on Ubuntu 22.04 | Step-by-Step Blog post from Mantas Levinas

I hope this was useful to you, and let’s see what’s next?

Greetings,

Thomas

#python, #cuda, #llamacpp, #ibmcloud, #gpu, #ai

Using CUDA and Llama-cpp to Run a Phi-3-Small-128K-Instruct Model on IBM Cloud VSI with GPUs

Step 1: Setup a VSI with a GPU on IBM Cloud

Step 2: We need to ensure to include the path to the NVIDIA CUDA Compiler Driver NVCC in the $PATH variable. The `nvcc` is a part of the `nvidia-cuda-toolkit`.

Step 3: Ensure we have the latest gcc compiler

Step 4: Compile the `llama.cpp` to use

Step 5: We generate a virtual Python environment

Step 6: We need to ensure that `llama-cpp-python` uses our compiled version

Step 7: We download the needed model

Step 8: Example code to run the model

Step 9: Run the model

2. Summary

3. Additional resources

Leave a comment Cancel reply

Blog Stats

Step 1: Setup a VSI with a GPU on IBM Cloud

Step 2: We need to ensure to include the path to the NVIDIA CUDA Compiler Driver NVCC in the $PATH variable. The nvcc is a part of the nvidia-cuda-toolkit.

Step 3: Ensure we have the latest gcc compiler

Step 4: Compile the llama.cpp to use

Step 5: We generate a virtual Python environment

Step 6: We need to ensure that llama-cpp-python uses our compiled version

Step 7: We download the needed model

Step 8: Example code to run the model

Step 9: Run the model

2. Summary

3. Additional resources

Share this:

Related

Leave a comment Cancel reply

Blog Stats

Step 2: We need to ensure to include the path to the NVIDIA CUDA Compiler Driver NVCC in the $PATH variable. The `nvcc` is a part of the `nvidia-cuda-toolkit`.

Step 4: Compile the `llama.cpp` to use

Step 6: We need to ensure that `llama-cpp-python` uses our compiled version